About

Goto LAST

“Copy/Paste is the mother of learning.”
“Repetition! Repetition is the mother of learning.”

Sources: GitHub | Google Drive | OneDrive

Environment

Assumption: Working directory has sub-folders named "data", "images", "code", "docs".

R Version

# #R Version
R.version.string
## [1] "R version 4.1.2 (2021-11-01)"

Working Directory

# #Working Directory
getwd()
## [1] "D:/Analytics/xADSM"

Session Info

# #Version information about R, the OS and attached or loaded packages
sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19042)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_India.1252  LC_CTYPE=English_India.1252    LC_MONETARY=English_India.1252
## [4] LC_NUMERIC=C                   LC_TIME=English_India.1252    
## 
## attached base packages:
## [1] compiler  grid      stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
##  [1] stringi_1.7.5        mlbench_2.1-3        glmnet_4.1-3         Matrix_1.3-4        
##  [5] caret_6.0-90         lattice_0.20-45      RColorBrewer_1.1-2   fastDummies_1.6.3   
##  [9] psych_2.1.9          scales_1.1.1         viridisLite_0.4.0    corrplot_0.92       
## [13] GGally_2.1.2         microbenchmark_1.4.9 ggpmisc_0.4.4        ggpp_0.4.2          
## [17] qcc_2.7              VIM_6.1.1            colorspace_2.0-2     mice_3.13.0         
## [21] nortest_1.0-4        Lahman_9.0-0         gapminder_0.3.0      nycflights13_1.0.2  
## [25] gifski_1.4.3-1       data.table_1.14.2    zoo_1.8-9            car_3.0-12          
## [29] carData_3.0-4        lubridate_1.8.0      e1071_1.7-9          latex2exp_0.5.0     
## [33] readxl_1.3.1         kableExtra_1.3.4     forcats_0.5.1        stringr_1.4.0       
## [37] dplyr_1.0.7          purrr_0.3.4          readr_2.1.0          tidyr_1.1.4         
## [41] tibble_3.1.6         ggplot2_3.3.5        conflicted_1.0.4    
## 
## loaded via a namespace (and not attached):
##  [1] ellipsis_0.3.2       class_7.3-19         rstudioapi_0.13      proxy_0.4-26        
##  [5] farver_2.1.0         listenv_0.8.0        MatrixModels_0.5-0   prodlim_2019.11.13  
##  [9] fansi_0.5.0          ranger_0.13.1        xml2_1.3.2           codetools_0.2-18    
## [13] splines_4.1.2        mnormt_2.0.2         cachem_1.0.6         robustbase_0.93-9   
## [17] knitr_1.36           jsonlite_1.7.2       pROC_1.18.0          broom_0.7.10        
## [21] httr_1.4.2           backports_1.3.0      assertthat_0.2.1     fastmap_1.1.0       
## [25] htmltools_0.5.2      quantreg_5.86        tools_4.1.2          gtable_0.3.0        
## [29] glue_1.5.0           reshape2_1.4.4       Rcpp_1.0.7           cellranger_1.1.0    
## [33] jquerylib_0.1.4      vctrs_0.3.8          svglite_2.0.0        nlme_3.1-153        
## [37] conquer_1.2.1        iterators_1.0.13     lmtest_0.9-39        timeDate_3043.102   
## [41] xfun_0.28            gower_0.2.2          laeken_0.5.2         globals_0.14.0      
## [45] rvest_1.0.2          lifecycle_1.0.1      future_1.23.0        DEoptimR_1.0-9      
## [49] MASS_7.3-54          ipred_0.9-12         hms_1.1.1            parallel_4.1.2      
## [53] SparseM_1.81         yaml_2.2.1           sass_0.4.0           rpart_4.1-15        
## [57] reshape_0.8.8        foreach_1.5.1        boot_1.3-28          shape_1.4.6         
## [61] lava_1.6.10          matrixStats_0.61.0   rlang_0.4.12         pkgconfig_2.0.3     
## [65] systemfonts_1.0.3    evaluate_0.14        recipes_0.1.17       tidyselect_1.1.1    
## [69] parallelly_1.29.0    plyr_1.8.6           magrittr_2.0.1       bookdown_0.24       
## [73] R6_2.5.1             generics_0.1.1       DBI_1.1.1            pillar_1.6.4        
## [77] withr_2.4.2          survival_3.2-13      abind_1.4-5          sp_1.4-6            
## [81] nnet_7.3-16          future.apply_1.8.1   crayon_1.4.2         utf8_1.2.2          
## [85] tmvnsim_1.0-2        tzdb_0.2.0           rmarkdown_2.11       ModelMetrics_1.2.2.2
## [89] vcd_1.4-9            digest_0.6.28        webshot_0.5.2        stats4_4.1.2        
## [93] munsell_0.5.0        bslib_0.3.1

Pandoc

# #Pandoc Version being used by RStudio
rmarkdown::pandoc_version()
## [1] '2.14.0.3'

Aside

I wanted to have a single document containing Notes, Codes, & Output for a quick reference for the lectures. Combination of multiple file formats (docx, csv, xlsx, R, png etc.) was not working out for me. So, I found the Bookdown package to generate this HTML file.

All of us had to stumble through some of the most common problems individually and as we are approaching deeper topics, a more collaborative approach might be more beneficial.

Further, the lectures are highly focused and thus I had to explore some sidetopics in more details to get the most benefit from them. I have included those topics and I am also interested in knowing about your experiences too.

Towards that goal, I am sharing these notes and hoping that you would run the code in your own environment and would raise any queries, problems, or difference in outcomes. Any suggestion or criticism is welcome. I have tried to not produce any significant changes in your working environment. Please let me know if you observe otherwise.

Currently, my priority is to get in sync with the ongoing lectures. The time constraint has led to issues given below. These will be corrected as and when possible.

  • Tone of the document may be a little abrupt, please overlook that
  • Source references are not added as much as I wanted to (from where I copy/pasted learned) and there is no easy solution for this.
  • I have NOT explained some of the functions before their usage (lapply(), identical() etc.) Hyperlinks for that will be added as and when those topics are covered
  • Code has been checked only on Windows 10. For Mac or Linux, if when you find something that has different output or behaviour, please let me know
  • Although these notes are generated using R Markdown and Bookdown, I have not yet covered these. If you need any help in creating your own notes, please let me know. If I have the solution for your problem, I will share.

Last, but not the least, I am also learning while creating this, so if you think I am wrong somewhere, please point it out. I am always open for suggestions.

Thank You all for the encouragement.

Shivam


(B01)


(B02)


(B03)


(B04)


(B05)


(B06)


(B07)


(B08)


1 R Introduction (B09, Aug-31)

1.1 R Basics

R is Case-sensitive i.e. c() not C() and View() not view()

Hash Sign “#” comments out anything after it, till the newline. There are no multiline comments.

Backslash “\” is reserved to escape the character that follows it.

Escape key stops the parser i.e. “+” sign where R is waiting for more input before evaluation.

Overview

1.1.1 R Studio

  • There are 4 Panes -
    1. Top Left - R Editor, Source
    2. Bottom Left - Console, Terminal, …
    3. Top Right - Environment, History, …
    4. Bottom Right - Plots, …
  • Sometimes there are only 3 panes i.e. Editor Pane is missing
    • To Open Editor Pane - Create a New R Script by “File | New File | R Script” or "Ctrl+ Shift+ N"
  • To Modify Pane Settings - Tools | Global Options | Pane Layout

1.1.2 Shortcuts

  • Execute the current expression in Source Pane (Top): “Ctrl+ Enter”
  • Execute the current expression in Console Pane (Bottom): “Enter”
  • Clear the Console Pane (Bottom): “Ctrl+ L”
  • Restart the Current R Session: “Ctrl+ Shift+ F10”
  • Create a New R Script: “Ctrl+ Shift+ N”
  • Insert ” <- ” i.e. Assignment Operator with Space: “Alt+ -”
  • Insert ” %>% ” i.e. Pipe Operator with Space: “Ctrl+ Shift+ M”
  • Comment or Uncomment Lines: “Ctrl+ Shift+ C”
  • Set Working Directory: “Ctrl+ Shift+ H”
  • Search Command History: “Ctrl+ Up Arrow”
  • Search Files: “Ctrl+ .”

1.1.3 Executing an Expression

Execute the current expression in Source Pane (Top) by ‘Run’ Button or "Ctrl+ Enter"

Execute the current expression in Console Pane (Bottom) by “Enter”

1.1.4 PATH and Working Directory

Windows 10 uses backslash “\” for PATH. R, however, uses slash “/.” Backslash “\” is escape character in R.

  • So, To provide “C:\Users\userName\Documents” as PATH
    • Use: “C:\\Users\\userName\\Documents”
    • OR: “C:/Users/userName/Documents”
    • OR: “~” Tilde acts as a Reference to Home Directory

In R Studio, Set Working Directory by:

  • Session | Set Working Directory | Choose Directory or "Ctrl+ Shift+ H"
# #Current Working Directory
getwd()
## [1] "D:/Analytics/xADSM"
#
# #R Installation Directory (Old DOS Convention i.e. ~1 after 6 letters)
R.home()
## [1] "C:/PROGRA~1/R/R-41~1.2"
Sys.getenv("R_HOME") 
## [1] "C:/PROGRA~1/R/R-41~1.2"
#
# #This is Wrapped in IF Block to prevent accidental execution
if(FALSE){
# #WARNING: This will change your Working Directory
  setwd("~")
}

1.1.5 Printing

If the R program is written over the console, line by line, then the output is printed automatically i.e. no function needed for printing. This is called implicit printing.

Inside an R Script File, implicit printing does not work and the expression needs to be printed explicitly.

In R, the most common method to print the output ‘explicitly’ is by the function print().

# #Implicit Printing: This will NOT be printed to Console, if it is inside an R Script.
"Hello World!"
#
# #Implicit Printing using '()': Same as above
("Hello World!")
#
# #Explicit Printing using print() : To print Objects to Console, even inside an R Script.
print("Hello World!")
## [1] "Hello World!"

1.2 Objects

1.2.1 List ALL Objects

Everything that exists in R is an object in the sense that it is a kind of data structure that can be manipulated. Expressions for evaluation are themselves objects; Evaluation consists of taking the object representing an expression and returning the object that is the value of that expression.

# #ls(): List ALL Objects in the Current NameSpace (Environment)
ls()
## character(0)

1.2.2 Assign a Value to an Object

Caution: Always use “<-” for the assignment, NOT the “=”

While the “=” can be used for assignment, its usage for assignment is highly discouraged because it may behave differently under certain subtle conditions which are difficult to debug. Convention is to use “=” only during function calls for arguments association (syntactic token).

There are 5 assignment operators (<-, =, <<-, ->, ->>), others are not going to be discussed for now.

All the created objects are listed in the Environment Tab of the Top Right Pane.

# #Assignment Operator "<-" is used to assign any value (ex: 10) to any object (ex: 'bb')
bb <- 10
#
# #Print Object
print(bb)
## [1] 10

1.2.3 Remove an Object

In the Environment Tab, any object can be selected and deleted using Brush.

# #Trying to Print an Object 'bb' and Handling the Error, if thrown
tryCatch(print(bb), error = function(e) print(paste0(e)))
## [1] 10
#
# #Remove an Object
rm(bb)
#
# #Equivalent
if(FALSE) {rm("bb")} #Same
if(FALSE) {rm(list = "bb")} #Faster, verbose, and would not work without quotes
#
# #Trying to Print an Object 'bb' and Handling the Error, if thrown
tryCatch(print(bb), error = function(e) print(paste0(e)))
## [1] "Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'print': object 'bb' not found\n"

1.3 Data

16.1 Data are the facts and figures collected, analysed, and summarised for presentation and interpretation.

16.2 Elements are the entities on which data are collected. (Generally ROWS)

16.3 A variable is a characteristic of interest for the elements. (Generally COLUMNS)

16.4 The set of measurements obtained for a particular element is called an observation.

16.5 Statistics is the art and science of collecting, analysing, presenting, and interpreting data.

1.4 Vectors

R has 6 basic data types (logical, integer, double, character, complex, and raw). These data types can be combined to form Data Structures (vector, list, matrix, dataframe, factor etc.). Refer What is a Vector!

Definition 1.1 Vectors are the simplest type of data structure in R. A vector is a sequence of data elements of the same basic type.
Definition 1.2 Members of a vector are called components.

Atomic vectors are homogeneous i.e. each component has the same datatype. A vector type can be checked with the typeof() or class() function. Its length, i.e. the number of elements in the vector, can be checked with the function length().

If the output of an expression does not show numbers in brackets like ‘[1]’ then it is a ’NULL’ type return. [Numbers] show that it is a Vector. Ex: str() and cat() outputs are of NULL Type.

Use function c() to create a vector (or a list) -

  • In R, a literal character or number is just a vector of length 1. So, c() ‘combines’ them together in a series of 1-length vectors.
  • c() neither creates nor concatenates the vectors, it combines them. Thus, it combines list into a list and vectors into a vector.
  • In R, list is a ‘Vector’ but not an ‘Atomic Vector.’
  • All arguments are coerced to a common type which is the type of the returned value.
  • All attributes (e.g. dim) except ‘names’ are removed.
  • The output type is determined from the highest type of the components in the hierarchy NULL < raw < logical < integer < double < complex < character < list < expression.
  • To “index a vector” means, to address specific elements by using square brackets, i.e. x10 means the \({10^{th}}\) element of vector ‘x.’

Caution: Colon “:” might produce unexpected length of vectors (in case of 0-length vectors). Suggestion: Use colon only with hardcoded numbers i.e. “1:10” is ok, “1:n” is dangerous and should be avoided.

Caution: seq() function might produce unexpected type of vectors (in case of 1-length vectors). Suggestion: Use seq_along(), seq_len().

Atomic Vectors

# #To know about an Object: str(), class(), length(), dim(), typeof(), is(), attributes(), names()
# #Integer: To declare as integer "L" (NOT "l") is needed
ii_int <- c(1L, 2L, 3L, 4L, 5L)
str(ii_int)
##  int [1:5] 1 2 3 4 5
#
# #Double (& Default)
dd_dbl <- c(1, 2, 3, 4, 5)
str(dd_dbl)
##  num [1:5] 1 2 3 4 5
#
# #Character
cc_chr <- c('a', 'b', 'c', 'd', 'e')
str(cc_chr)
##  chr [1:5] "a" "b" "c" "d" "e"
#
# #Logical
ll_lgl <- c(TRUE, FALSE, FALSE, TRUE, TRUE)
str(ll_lgl)
##  logi [1:5] TRUE FALSE FALSE TRUE TRUE

Integer

# #Integer Vector of Length 1
nn <- 5L
#
# #Colon ":" Operator - Avoid its usage
str(c(1:nn))
##  int [1:5] 1 2 3 4 5
c(typeof(pi:6), typeof(6:pi))
## [1] "double"  "integer"
#
# #seq() - Avoid its usage
str(seq(1, nn))
##  int [1:5] 1 2 3 4 5
str(seq(1, nn, 1))
##  num [1:5] 1 2 3 4 5
str(seq(1, nn, 1L))
##  num [1:5] 1 2 3 4 5
str(seq(1L, nn, 1L))
##  int [1:5] 1 2 3 4 5
#
# #seq_len()
str(seq_len(nn))
##  int [1:5] 1 2 3 4 5

Double

str(seq(1, 5, 1))
##  num [1:5] 1 2 3 4 5

Character

str(letters[1:5])
##  chr [1:5] "a" "b" "c" "d" "e"

Logical

str(1:5 %% 2 == 0)
##  logi [1:5] FALSE TRUE FALSE TRUE FALSE

1.5 DataFrame

# #Create Two Vectors
income <- c(100, 200, 300, 400, 500)
gender <- c("male", "female", "female", "female", "male")
#
# #Create a DataFrame
bb <- data.frame(income, gender)
#
# #Print or View DataFrame
#View(bb)
print(bb)
##   income gender
## 1    100   male
## 2    200 female
## 3    300 female
## 4    400 female
## 5    500   male
#
# #Struture
str(bb)
## 'data.frame':    5 obs. of  2 variables:
##  $ income: num  100 200 300 400 500
##  $ gender: chr  "male" "female" "female" "female" ...
#
# #Names
names(bb)
## [1] "income" "gender"

1.6 Save and Load an R Script

R Script file extension is “.R”

"Ctrl+ S" will Open Save Window at Working Directory.

"Ctrl+ O" will Open the Browse Window at Working Directory.

Check File Exist

# #Subdirectory "data" has data files like .csv .rds .txt .xlsx
# #Subdirectory "code" has scripts files like .R 
# #Subdirectory "images" has images like .png
#
# #Check if a File exists 
path_relative <- "data/aa.xlsx" #Relative Path
#
if(file.exists(path_relative)) {
    cat("File Exists\n") 
  } else {
    cat(paste0("File does not exist at ", getwd(), "/", path_relative, "\n"))
  }
## File Exists
#
if(exists("XL", envir = .z)) {
  cat(paste0("Absolute Path exists as: ", .z$XL, "\n"))
  path_absolute <- paste0(.z$XL, "aa", ".xlsx") #Absolute Path
  #
  if(file.exists(path_absolute)) {
    cat("File Exists\n") 
  } else {
    cat(paste0("File does not exist at ", path_absolute, "\n"))
  }
} else {
  cat(paste0("Object 'XL' inside Hidden Environment '.z' does not exist. \n", 
             "It is probably File Path of the Author, Replace the File Path from Your own Directory\n"))
}
## Absolute Path exists as: D:/Analytics/xADSM/data/
## File Exists

Aside

  • This section is NOT useful for general reader and can be safely ignored. It contains my notes related to building this book. These are useful only for someone who is building his own book. (Shivam)
  • “Absolute Path” is NOT a problem in Building a Book, Knitting a Chapter, or on Direct Console.
  • “Absolute Path” has a problem only when Running code chunk directly from the Rmd document and when the Rmd document is inside a sub-directory (like in this book), then only the Working Directory differs.

1.7 CSV Import /Export

write.csv() and read.csv() combination can be used to export data and import it back into R. But, it has some limitations -

  • Re-imported object “yy_data” will NOT match with the original object “xx_data” under default conditions
    1. write.csv(), by default, write row.names (or row numbers) in the first column.
      • So, either use row.names = FALSE while writing
      • OR use row.names = 1 while reading
    2. row.names attribute is always read as ‘character’ even though originally it might be ‘integer.’
      • So, that attribute needs to be coerced
    3. colClasses() needs to be defined to match with the original dataframe, otherwise ‘income’ is read as ‘integer,’ even though originally it was ‘numeric.’
    4. Conclusion: Avoid, if possible.
  • Alternative: saveRDS() and readRDS()
    • Functions to write a single R object to a file, and to restore it.
    • Imported /Exported objects are always identical
ERROR 1.1 Error in file(file, ifelse(append, "a", "w")) : cannot open the connection
  • Check the path, file name, & file extension for typing mistakes
  • Execute getwd(), just before the command, to confirm that the working directory is as expected

write.csv()

str(bb)
## 'data.frame':    5 obs. of  2 variables:
##  $ income: num  100 200 300 400 500
##  $ gender: chr  "male" "female" "female" "female" ...
#
xx_data <- bb
#
# #Write a dataframe to a CSV File
write.csv(xx_data, "data/B09_xx_data.csv")
#
# #Read from the CSV into a dataframe
yy_data <- read.csv("data/B09_xx_data.csv")
#
# #Check if the object being read is same as the obejct that was written 
identical(xx_data, yy_data)
## [1] FALSE

Match Objects

# #Exercise to show how to match the objects being imported /exported from CSV
str(bb)
## 'data.frame':    5 obs. of  2 variables:
##  $ income: num  100 200 300 400 500
##  $ gender: chr  "male" "female" "female" "female" ...
xx_data <- bb
# #Write to CSV
write.csv(xx_data, "data/B09_xx_data.csv")
#
# #Read from CSV by providing row.names Column and colClasses()
yy_data <- read.csv("data/B09_xx_data.csv", row.names = 1,
                    colClasses = c('character', 'numeric', 'character'))
#
# #Coerce row.names attribute to integer
attr(yy_data, "row.names") <- as.integer(attr(yy_data, "row.names"))
#
# #Check if the objects are identical
identical(xx_data, yy_data)
## [1] TRUE
stopifnot(identical(xx_data, yy_data))

RDS

str(bb)
## 'data.frame':    5 obs. of  2 variables:
##  $ income: num  100 200 300 400 500
##  $ gender: chr  "male" "female" "female" "female" ...
xx_data <- bb
#
# #Save the Object as RDS File
saveRDS(xx_data, "data/B09_xx_data.rds")
#
# #Read from the RDS File
yy_data <- readRDS("data/B09_xx_data.rds")
#
# #Objects are identical (No additional transformations are needed)
identical(xx_data, yy_data)
## [1] TRUE

1.8 Modify Dataframe

str(xx_data)
## 'data.frame':    5 obs. of  2 variables:
##  $ income: num  100 200 300 400 500
##  $ gender: chr  "male" "female" "female" "female" ...
# #Adding a Column to a dataframe
xx_data <- data.frame(xx_data, age = 22:26)
#
# #Adding a Column to a dataframe by adding a Vector
x_age <- 22:26
xx_data <- data.frame(xx_data, x_age)
str(xx_data)
## 'data.frame':    5 obs. of  4 variables:
##  $ income: num  100 200 300 400 500
##  $ gender: chr  "male" "female" "female" "female" ...
##  $ age   : int  22 23 24 25 26
##  $ x_age : int  22 23 24 25 26
#
# #Adding a Column to a dataframe by using dollar "$"
xx_data$age1 <- x_age
#
# #Adding a Blank Column using NA
xx_data$blank <- NA
#
# #Editing of a dataframe can also be done
# edit(xx_data)
str(xx_data)
## 'data.frame':    5 obs. of  6 variables:
##  $ income: num  100 200 300 400 500
##  $ gender: chr  "male" "female" "female" "female" ...
##  $ age   : int  22 23 24 25 26
##  $ x_age : int  22 23 24 25 26
##  $ age1  : int  22 23 24 25 26
##  $ blank : logi  NA NA NA NA NA
#
# #Removing a Column by subsetting
xx_data <- xx_data[ , -c(3)]
#
# #Removing a Column using NULL
xx_data$age1 <- NULL
str(xx_data)
## 'data.frame':    5 obs. of  4 variables:
##  $ income: num  100 200 300 400 500
##  $ gender: chr  "male" "female" "female" "female" ...
##  $ x_age : int  22 23 24 25 26
##  $ blank : logi  NA NA NA NA NA

1.9 Packages

Definition 1.3 Packages are the fundamental units of reproducible R code.

Packages include reusable functions, the documentation that describes how to use them, and sample data.

In R Studio: Packages Tab | Install | Package Name = “psych” | Install

  • Packages are installed from CRAN Servers
    • To Change Server: Tools | Global Options | Packages | Primary CRAN Repository | Change | CRAN Mirrors (Select Your Preference) | OK
    • All Installed Packages are listed under Packages Tab
    • All Loaded Packages are listed under Packages Tab with a Tick Mark
    • Some packages are dependent on other packages and those are also installed when ‘dependencies = TRUE’
    • If a package is NOT installed properly, it will show error when loaded by library()

Install Packages

if(FALSE){
  # #WARNING: This will install packages and R Studio will NOT work for that duration
  # #Install Packages and their dependencies
  install.packages("psych", dependencies = TRUE)
}

Load Packages

# #Load a Package with or without Quotes
library(readxl)
library("readr")

Load Multiple Packages

# #Load Multiple Packages
pkg_chr <- c("ggplot2", "tibble", "tidyr", "readr", "dplyr")
#lapply(pkg_chr, FUN = function(x) {library(x, character.only = TRUE)})
#
# #Load Multiple Packages, Suppress Startup Messages, and No console output
invisible(lapply(pkg_chr, FUN = function(x) {
  suppressMessages(library(x, character.only = TRUE))}))

Detach Package

# #Detach a package
#detach("package:psych", unload = TRUE)
#
# #Search Package in the already loaded packages
pkg_chr <- "psych"
if (pkg_chr %in% .packages()) {
# #Detach a package that has been loaded previously
  detach(paste0("package:", pkg_chr), character.only = TRUE, unload = TRUE)
}

1.10 Import Flights Data

To Import Excel in R Studio : Environment | Dropdown | From Excel | Browse

Object imported by read.csv() i.e. ‘mydata’ is NOT same as the one imported by read_excel() i.e. ‘mydata_xl’

  • read_excel() imports as a Tibble which is a modern view of dataframe. It is more restrictive so that output would be more predictable.
  • read.csv(), if possible, imports as integer (ex: ‘year’ column). But, read_excel() imports, if possible, as a numeric.
  • Further, read_excel() has imported many columns as ‘character’ that should have been ‘numeric’ ex: dep_time
  • NOTE: To complete the set readr::read_csv() is also covered here which reads CSV and generates a Tibble.

All of these objects can be converted into any other form as needed i.e. dataframe to tibble or vice-versa.

Flights

# #Data File Name has been modified to include lecture number "B09"
# #All Data Files are in the sub-directory named 'data'
mydata <- read.csv("data/B09-FLIGHTS.csv")
#
# #To Copy from Clipboard, assuming copied from xlsx i.e. tab separated data
mydata_clip <- read.csv("clipboard", sep = '\t', header = TRUE)

RDS

# #Following Setup allows us to read CSV only once and then create an RDS file
# #Its advantage is in terms of faster loading time and lower memory requirment
xx_csv <- paste0("data/", "B09-FLIGHTS", ".csv")
xx_rds <- paste0("data/", "b09_flights", ".rds")
b09_flights <- NULL
if(file.exists(xx_rds)) {
  b09_flights <- readRDS(xx_rds)
} else {
  # #Read CSV
  b09_flights <- read.csv(xx_csv)
  # #Write Object as RDS
  saveRDS(b09_flights, xx_rds)
}
rm(xx_csv, xx_rds)
mydata <- b09_flights

Structure

str(mydata)
## 'data.frame':    336776 obs. of  19 variables:
##  $ year          : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int  515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : int  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int  819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : int  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr  "UA" "UA" "AA" "B6" ...
##  $ flight        : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : int  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : int  1400 1416 1089 1576 762 719 1065 229 944 733 ...
##  $ hour          : int  5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : int  15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : chr  "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

Tail

tail(mydata)
##        year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## 336771 2013     9  30       NA           1842        NA       NA           2019        NA      EV
## 336772 2013     9  30       NA           1455        NA       NA           1634        NA      9E
## 336773 2013     9  30       NA           2200        NA       NA           2312        NA      9E
## 336774 2013     9  30       NA           1210        NA       NA           1330        NA      MQ
## 336775 2013     9  30       NA           1159        NA       NA           1344        NA      MQ
## 336776 2013     9  30       NA            840        NA       NA           1020        NA      MQ
##        flight tailnum origin dest air_time distance hour minute           time_hour
## 336771   5274  N740EV    LGA  BNA       NA      764   18     42 2013-09-30 18:00:00
## 336772   3393    <NA>    JFK  DCA       NA      213   14     55 2013-09-30 14:00:00
## 336773   3525    <NA>    LGA  SYR       NA      198   22      0 2013-09-30 22:00:00
## 336774   3461  N535MQ    LGA  BNA       NA      764   12     10 2013-09-30 12:00:00
## 336775   3572  N511MQ    LGA  CLE       NA      419   11     59 2013-09-30 11:00:00
## 336776   3531  N839MQ    LGA  RDU       NA      431    8     40 2013-09-30 08:00:00

Excel

# #library(readxl)
mydata_xl <- read_excel("data/B09-FLIGHTS.xlsx", sheet = "FLIGHTS")

Excel RDS

# #library(readxl)
xx_xl <- paste0("data/", "B09-FLIGHTS", ".xlsx")
xx_rds_xl <- paste0("data/", "b09_flights_xls", ".rds")
b09_flights_xls <- NULL
if(file.exists(xx_rds_xl)) {
  b09_flights_xls <- readRDS(xx_rds_xl)
} else {
  b09_flights_xls <- read_excel(xx_xl, sheet = "FLIGHTS")
  saveRDS(b09_flights_xls, xx_rds_xl)
}
rm(xx_xl, xx_rds_xl)
mydata_xl <- b09_flights_xls
#

xlsx

str(mydata_xl)
## tibble [336,776 x 19] (S3: tbl_df/tbl/data.frame)
##  $ year          : num [1:336776] 2013 2013 2013 2013 2013 ...
##  $ month         : num [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : num [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : chr [1:336776] "517" "533" "542" "544" ...
##  $ sched_dep_time: num [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : chr [1:336776] "2" "4" "2" "-1" ...
##  $ arr_time      : chr [1:336776] "830" "850" "923" "1004" ...
##  $ sched_arr_time: num [1:336776] 819 830 850 1022 837 ...
##  $ arr_delay     : chr [1:336776] "11" "20" "33" "-18" ...
##  $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
##  $ flight        : num [1:336776] 1545 1714 1141 725 461 ...
##  $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : chr [1:336776] "227" "227" "160" "183" ...
##  $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
##  $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

readr

# #Following Setup allows us to read CSV only once and then create an RDS file
# #Its advantage is in terms of faster loading time and lower memory requirment
# #library(readr)
xx_csv <- paste0("data/", "B09-FLIGHTS", ".csv")
xx_rds <- paste0("data/", "xxflights", ".rds")
xxflights <- NULL
if(file.exists(xx_rds)) {
  xxflights <- readRDS(xx_rds)
} else {
  xxflights <- read_csv(xx_csv, show_col_types = FALSE)
  attr(xxflights, "spec") <- NULL
  attr(xxflights, "problems") <- NULL
  saveRDS(xxflights, xx_rds)
}
rm(xx_csv, xx_rds)
mydata_rdr <- xxflights

1.11 Subsetting

# #Subset All Rows and last 3 columns
data6 <- mydata[ , c(17:19)]
str(data6)
## 'data.frame':    336776 obs. of  3 variables:
##  $ hour     : int  5 5 5 5 6 5 6 6 6 6 ...
##  $ minute   : int  15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour: chr  "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
# #Subset by deleting the 1:16 columns
data7 <- mydata[ , -c(1:16)]
stopifnot(identical(data6, data7))

1.12 Attach a Dataset

Caution: Attaching a Dataset should be avoided to prevent unexpected behaviour due to ‘masking.’ Using full scope resolution i.e. ‘data_frame$column_header’ would result in fewer bugs. However, if a Dataset has been attached, please ensure that it is detached also.

Caution: If a dataset is attached more than once e.g. 4 times, please note that there will be 4 copies attached to the environment. It can be checked with search(). Each needs to be detached.

if(FALSE){
  # #WARNING: Attaching a Dataset is discouraged because of 'masking'
  # #'dep_time' is Column Header of a dataframe 'mydata'
  tryCatch(str(dep_time), error = function(e) print(paste0(e)))
## [1] "Error in str(dep_time): object 'dep_time' not found\n"
  # #Attach the Dataset
  attach(mydata)
  # #Now all the column headers are accessible without the $ sign
  str(dep_time)
## int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
  # #But, there are other datasets also, attaching another one results in MESSAGE
  attach(mydata_xl)
## The following objects are masked from mydata:
##
##     air_time, arr_delay, arr_time, carrier, day, dep_delay, dep_time, dest,
##     distance, flight, hour, minute, month, origin, sched_arr_time,
##     sched_dep_time, tailnum, time_hour, year
  str(dep_time)
## chr [1:336776] "517" "533" "542" "544" "554" "554" "555" "557" "557" "558" "558" ...
#
# #'mydata_xl$dep_time' masked the already present 'mydata$dep_time'.
# #Thus now it is showing as 'chr' in place of original 'int'
# #Column Header Names can be highly varied and those will silently mask other variable
# #Hence, attaching a dataset would result in random bugs or unexpected behaviours
#
# #Detach a Dataset
  detach(mydata_xl)
  detach(mydata)
}

1.13 Package “psych”

  • pairs.panels() -
    • It shows a scatter plot of matrices, with bivariate scatter plots below the diagonal, histograms on the diagonal, and the Pearson correlation above the diagonal.
      • Calculation time is highly dependent on dataset size and type
      • See Figure 1.1
      • Conclusion: “air_time and distance are highly correlated”
ERROR 1.2 Error in plot.window(...) : need finite ’xlim’ values
  • In this case, the error will be observed if the output of pairs.panels() is assigned to an object.
  • Direct console output (i.e. no assignment) should not be a problem
ERROR 1.3 Error in par(old.par) : invalid value specified for graphical parameter "pin"
  • The Error is generally observed because the Plot does not have enough space in R Studio (Lower Right Pane). In general, it is NOT a problem. It is an error of map() xlim[1] should be less than xlim[2].
  • Use larger window size or control image size output

Image

Correlation using psych::pairs.panels()

Figure 1.1 Correlation using psych::pairs.panels()

Code

# # Subset 3 Columns and 10,000 rows 
x_rows <- 10000L
data_pairs <- mydata[1:x_rows, c(7, 16, 9)]
#
# #Equivalent
ii <- mydata %>%
  select(air_time, distance, arr_delay) %>%
  slice_head(n = x_rows)
#
stopifnot(identical(ii, data_pairs))
#
if( nrow(data_pairs) * ncol(data_pairs) > 1000000 ) {
  print("Please reduce the number of points to a sane number!")
  ggplot()
} else {
  #B09P01
  pairs.panels(data_pairs)
  if(FALSE){# Cleaner Graph
    pairs.panels(data_pairs, smooth = FALSE, jiggle = FALSE, rug = FALSE, ellipses = FALSE, 
                 cex.cor = 1, cex = 1, gap = 0, main = "Title")
    title(sub = "Caption", line = 4, adj = 1)   
  }
}

1.14 Operators in R

  • There are multiple infix binary operators
  • \textcolor{pink}{a %in% b} : returns a logical vector indicating if there is a match or not for its left operand
  • \textcolor{pink}{%/%} and \textcolor{pink}{%%} perform integer division and modular division respectively
  • \textcolor{pink}{%o%} gives the outer product of arrays.
  • \textcolor{pink}{%*%} performs matrix multiplication.
  • \textcolor{pink}{%x%} performs the Kronecker product of arrays.
  • Magritter Pipe is given by \textcolor{pink}{%>%}
  • There are other pipes and other packages also but this is general summary.

Validation


2 R Introduction (B10, Sep-05)

2.2 Notebooks

These allow you to combine executable code and rich text in a single document, along with images, HTML, LaTeX and more.

Definition 2.1 R Markdown is a file format for making dynamic documents with R.

An R Markdown document is written in markdown (an easy-to-write plain text format) and contains chunks of embedded R code. To know more Go to Rstudio

To know more about Google Colab Go to Google Colab

NOTE: As I am not using Google Colab, the workflow explained between 00:00 to 35:10 is NOT covered here. If someone is using Google Colab, and is willing to share their notes, I would include those.

2.3 Plot

Base R graphs /plots as shown in figure 2.1

ERROR 2.1 Error in plot.window(...) : need finite ’xlim’ values
  • If this error is coming when base R plot() function is called
  • Check if data has NA values or if character data is supplied where numerical is needed
  • Also, do not use assignment to save base R plot and then print
Flights: Arrival Time (Y) vs. Departure (X) Time

Figure 2.1 Flights: Arrival Time (Y) vs. Departure (X) Time

2.4 Dataset

  • Use cbind() or rbind() to merge dataframes

Dimensions

# #Create a Subset of Dataframe of 1000 Rows for quick calculations
bb <- head(mydata, 1000)
#
# #Dimensions: dim() Row x Column; nrow(); ncol()
dim(bb)
## [1] 1000   19
#
stopifnot(identical(nrow(bb), dim(bb)[1]))
stopifnot(identical(ncol(bb), dim(bb)[2]))

Split

# #Split a Dataframe by subsetting
data_1 <- bb[ , 1:8]
data_2 <- bb[ , 9:19]
# str(data_1)

Merge

# #Merge a Dataframe by cbind()
data_3 <- cbind(data_1, data_2)
# #Equivalent
data_4 <- data.frame(data_1, data_2)
# str(bb_3)
stopifnot(identical(data_3, data_4))

RowSplit

# #Row Split
data_5 <- bb[1:300, ]
data_6 <- bb[301:1000, ]
#
# #Equivalent
n_rows <- 300L
data_5 <- bb[1:n_rows, ]
data_6 <- bb[(n_rows + 1L):nrow(bb), ]
#
stopifnot(identical(data_5, head(bb, n_rows)))
stopifnot(identical(data_6, tail(bb, (nrow(bb) - n_rows))))

RowMerge

# #Merge a Dataframe by rbind()
data_7 <- rbind(data_5, data_6)
stopifnot(identical(bb, data_7))

2.5 Change Column Headers

# #Change A Specific Name based on Index Ex: First Header "year" -> "YEAR"
# #NOTE: Output of 'names(bb)' is a character vector, not a dataframe
# #So, [1] is being used to subset for 1st element and NOT the [ , 1] (as done for dataframe)
(names(bb)[1] <- "YEAR")
## [1] "YEAR"
#
# #Change all Column Headers to Uppercase by toupper() or Lowercase by tolower()
names(bb) <- toupper(names(bb))

2.6 NA

Definition 2.2 NA is a logical constant of length 1 which contains a missing value indicator.

NA can be coerced to any other vector type except raw. There are also constants like NA_integer_, NA_real_ etc. For checking only the presence of NA, anyNA() is faster than is.na()

Overview of ‘Not Available’

  • If the imported data has blank cell, it would be imported as NA

To remove all NA

  • na.omit()
    • Output is a dataframe
    • It is slower but adds the omitted row numbers as an attribute i.e. na.action
  • complete.cases()
    • Output is a logical vector, thus it needs subsetting to get the dataframe
    • Faster and also allows partial selection of columns i.e. ignore NA in other columns
    • Caution: It may throw Error if ‘POSIXlt’ Columns are present
  • tidyr::drop_na()
  • rowSums(is.na())
    • It can also be used for excluding rows with more than allowed numbers of NA. However, in general, this is not recommended because random columns retain NA. These may break the code later or change the number of observations. It is useful when all columns are similar in nature e.g. if each column represent response to a survey question.

NA

bb <- xxflights
# #anyNA() is faster than is.na()
if(anyNA(bb)) print("NA are Present!") else print("NA not found")
## [1] "NA are Present!"
#
# #Columnwise NA Count
bb_na_col <- colSums(is.na(bb))
#
# #Vector of Columns having NA
which(bb_na_col != 0)
##  dep_time dep_delay  arr_time arr_delay   tailnum  air_time 
##         4         6         7         9        12        15
stopifnot(identical(which(bb_na_col != 0), which(vapply(bb, anyNA, logical(1)))))
#
# #Indices of Rows with NA
head(which(!complete.cases(bb)))
## [1] 472 478 616 644 726 734
#
# #How many rows contain NA
sum(!complete.cases(bb))
## [1] 9430
#
# #How many rows have NA in specific Columns
sum(!complete.cases(bb[, c(6, 9, 4)]))
## [1] 9430

RemoveNA

# #Remove all rows which have any NA 
# #na.omit(), complete.cases(), tidyr::drop_na(), rowSums(is.na())
bb_1 <- na.omit(bb)
# #Print the Count of removed rows containg NA
print(paste0("Note: ", length(attributes(bb_1)$na.action), " rows removed."))
## [1] "Note: 9430 rows removed."
#
# #Remove additional Attribute added by na.omit()
attr(bb_1, "na.action") <- NULL
#
# #Equivalent 
bb_2 <- bb[complete.cases(bb), ]
bb_3 <- bb %>% drop_na()
bb_4 <- bb[rowSums(is.na(bb)) == 0, ]
#Validation
stopifnot(all(identical(bb_1, bb_2), identical(bb_1, bb_3), identical(bb_1, bb_4)))
#
# #complete.cases also allow partial selection of specific columns
# #Remove rows which have NA in some columns i.e. ignore NA in other columns
dim(bb[complete.cases(bb[ , c(6, 9, 4)]), ])
## [1] 327346     19
# #Equivalent 
dim(bb %>% drop_na(dep_delay, arr_delay, dep_time))
## [1] 327346     19
#
# #Remove rows which have more than allowed number of NA (ex:4) in any column
# #Caution: In general, this is not recommended because random columns retain NA
dim(bb[rowSums(is.na(bb)) <= 4L, ])
## [1] 328521     19

2.7 Apply

Sources: (SO) Grouping Functions and the Apply Family, (SO) Why is vapply safer than sapply, Hadley - Advanced R - Functionals, This, This, & This

Apply Function in R are designed to avoid explicit use of loop constructs.

  • To manipulate slices of data in a repetitive way.
  • They act on an input list, matrix or array, and apply a named function with one or several optional arguments.
  1. apply(X, MARGIN, FUN, ..., simplify = TRUE)
    • Refer R Manual p72 - “Apply Functions Over Array Margins”
    • Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.
    • When you want to apply a function to the rows or columns of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first
    • MARGIN = 1 indicates application over ROWS, 2 indicates COLUMNS
    • Examples & Details: “ForLater”
  2. lapply(X, FUN, ...)
    • Refer R Manual p342 - “Apply a Function over a List or Vector”
    • ‘list’ apply i.e. lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.
    • Examples & Details: “ForLater”
    • When you want to apply a function to each element of a list in turn and get a list back.
    • lapply(x, mean)
    • lapply(x, function(x) c(mean(x), sd(x)))
  3. sapply(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)
    • ‘simplified’ wrapper of lapply
    • When you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list.
    • Caution: It sometimes fails silently or unexpectedly changes output type
  4. vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)
    • ‘verified’ apply i.e. vapply is similar to sapply, but has a pre-specified type of return value, so it can be safer (and sometimes faster) to use.
    • vapply returns a vector or array of type matching the FUN.VALUE.
    • With FUN.VALUE you can specify the type and length of the output that should be returned each time your applied function is called.
    • It improves consistency by providing limited return type checks.
    • Further, if the input length is zero, sapply will always return an empty list, regardless of the input type (Thus behaving differently from non-zero length input). Whereas, with vapply, you are guaranteed to have a particular type of output, so you do not need to write extra checks for zero length inputs.
  5. Others - “ForLater”
    • tapply is a tagged apply where the tags identify the subsets
    • mapply for applying a function to multiple arguments
    • rapply for a ‘recursive’ version of lapply
    • eapply for applying a function to each entry in an ‘environment’
# #Subset Dataframe 
bb <- xxflights
data_8 <- bb[ , c("dep_delay", "arr_delay", "dep_time")]
#data_8 <- bb %>% select(dep_delay, arr_delay, dep_time) 
#
# #Remove missing values
data_9 <- na.omit(data_8)
#
# #Calculate Columnwise Mean
(bb_1 <- apply(data_9, 2, mean))
##   dep_delay   arr_delay    dep_time 
##   12.555156    6.895377 1348.789883
bb_2 <- unlist(lapply(data_9, mean))
bb_3 <- sapply(data_9, mean)
bb_4 <- vapply(data_9, mean, numeric(1))
#
stopifnot(all(identical(bb_1, bb_2), identical(bb_1, bb_3), identical(bb_1, bb_4)))

2.8 Vectors

Refer The 6 Datatypes of Atomic Vectors

Create a Basic Tibble, Table2.1, for evaluating ‘is.x()’ series of functions in Base R

  • anyNA() is TRUE if there is an NA present, FALSE otherwise
  • is.atomic() is TRUE for All Atomic Vectors, factor, matrix but NOT for list
  • is.vector() is TRUE for All Atomic Vectors, list but NOT for factor, matrix, DATE & POSIXct
    • Caution: With vapply() it returns TRUE for matrix (it checks individual elements)
    • Caution: FALSE if the vector has attributes (except names) ex: DATE & POSIXct
  • is.numeric() is TRUE for both integer and double
  • is.integer(), is.double(), is.character(), is.logical() are TRUE for their respective datatypes only
  • is.factor(), is.ordered() are membership functions for factors with or without ordering
    • For more: nlevels(), levels()
  • lubridate
    • is.timepoint() is TRUE for POSIXct, POSIXlt, or Date
    • is.POSIXt(), is.Date() are TRUE for their respective datatypes only
Table 2.1: (B10T01) Vector Classes
ii dd cc ll ff fo dtm dat
1 1 a FALSE odd odd 2022-01-25 22:28:40 2022-01-26
2 2 b TRUE even even 2022-01-25 22:28:41 2022-01-27
3 3 c FALSE odd odd 2022-01-25 22:28:42 2022-01-28
4 4 d TRUE even even 2022-01-25 22:28:43 2022-01-29
5 5 e FALSE odd odd 2022-01-25 22:28:44 2022-01-30
6 6 f TRUE even even 2022-01-25 22:28:45 2022-01-31

Basic Tibble

# #Basic Tibble
nn <- 6L
xxbasic10 <- tibble(ii = 1:nn, dd = seq(1, nn, 1), cc = head(letters, nn), 
             ll = (ii %% 2) == 0, ff = factor(rep(c("odd", "even"), length.out = nn)),
             fo = factor(rep(c("odd", "even"), length.out = nn), ordered = TRUE),
             dtm = Sys.time() + 1:nn, dat = Sys.Date() + 1:nn)
bb <- xxbasic10
str(bb)

is

# #Validation
# #anyNA() is TRUE if there is an NA present, FALSE otherwise
vapply(bb, anyNA, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#
# #is.atomic() is TRUE for All Atomic Vectors, factor, matrix but NOT for list
vapply(bb, is.atomic, logical(1))
##   ii   dd   cc   ll   ff   fo  dtm  dat 
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#
# #is.vector() is TRUE for All Atomic Vectors, list but NOT for factor, matrix, DATE & POSIXct
# #CAUTION: With vapply() it returns TRUE for matrix (it checks individual elements)
# #CAUTION: FALSE if the vector has attributes (except names) ex: DATE & POSIXct
vapply(bb, is.vector, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
##  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
#
# #is.numeric() is TRUE for both integer and double
vapply(bb, is.numeric, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
##  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
#
# #is.integer() is TRUE only for integer
vapply(bb, is.integer, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
##  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#
# #is.double() is TRUE only for double
vapply(bb, is.double, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE
#
# #is.character() is TRUE only for character
vapply(bb, is.character, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
#
# #is.logical() is TRUE only for logical
vapply(bb, is.logical, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

Factor

# #Factors
# #is.factor() is TRUE only for factor (unordered or ordered)
vapply(bb, is.factor, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
#
# #is.ordered() is TRUE only for ordered factor
vapply(bb, is.ordered, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
#
# #nlevels()
vapply(bb, nlevels, integer(1))
##  ii  dd  cc  ll  ff  fo dtm dat 
##   0   0   0   0   2   2   0   0
#
# #levels()
vapply(bb, function(x) !is.null(levels(x)), logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
#
# #table()
table(bb$ff)
## 
## even  odd 
##    3    3

lubridate::is

# #Package lubridate covers the missing functions for POSIXct, POSIXlt, or Date 
# #is.timepoint() is TRUE for POSIXct, POSIXlt, or Date
vapply(bb, is.timepoint, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
#
# #is.POSIXt() is TRUE only for POSIXct 
vapply(bb, is.POSIXt, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#
# #is.Date() is only TRUE for DATE 
vapply(bb, is.Date, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

Duplicates

# #Which Columns have Duplicate Values
vapply(bb, function(x) anyDuplicated(x) != 0L, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE

2.9 Factors

Definition 2.3 Factors are the data objects which are used to categorize the data and store it as levels.

They can store both strings and integers. They are useful in the columns which have a limited number of unique values. Like “Male, Female” and “True, False” etc. They are useful in data analysis for statistical modelling.

Factor is nothing but the numeric representation of the character vector.

as.factor() vs. factor()

  • as.factor() is faster than factor() when input is a factor or integer
  • as.factor retains unused or NA levels whereas factor drops them
    • levels can also be dropped using droplevels()
  • (SO) Levels vs. Labels
    • Levels are Input, Labels are Output in factor().
    • There is only ‘level’ attribute, no ‘label’ attribute.
    • In R (unlike SPSS) there is NO difference between what is stored and what is displayed. As soon as the levels (“papaya,” “banana”) are given labels (“pink,” “black”), there is NO way to get back the original levels.
    • (Aside) This misconception persists because generally we change between numerical to factor or binary character to factor. In these situations generally we know which level shows what and edit /sort them immediately.
  • Caution: It is highly advised to use ‘levels’ when creating factors to keep a link on their ordering and not to be dependent upon their occurance in the vector.
    • Further, this helps while relabeling them because the order is known.
    • By using ‘forcats’ or ‘car’ packages (probably) it can be done for default situation. By sorting factor levels by their integer equivalent before assigning new labels to them. However, this is low priority for now.

Transformation

str(bb$ll)
##  logi [1:6] FALSE TRUE FALSE TRUE FALSE TRUE
# #Coercion to Factor
bb$new <- as.factor(bb$ll)
str(bb$new)
##  Factor w/ 2 levels "FALSE","TRUE": 1 2 1 2 1 2
#
# #table()
table(bb$ll)
## 
## FALSE  TRUE 
##     3     3
table(bb$new)
## 
## FALSE  TRUE 
##     3     3
#
# #Levels can be Labelled differently also
str(bb$ff)
##  Factor w/ 2 levels "even","odd": 2 1 2 1 2 1
# # 
str(factor(bb$ff, levels = c("even", "odd"), labels = c("day", "night")))
##  Factor w/ 2 levels "day","night": 2 1 2 1 2 1
str(factor(bb$ff, levels = c("odd", "even"), labels = c("day", "night")))
##  Factor w/ 2 levels "day","night": 1 2 1 2 1 2
#
# #Coercion from Factor to character, logical etc.
bb$xcc <- as.character(bb$new)
bb$xll <- as.logical(bb$new)
#
str(bb)
## tibble [6 x 11] (S3: tbl_df/tbl/data.frame)
##  $ ii : int [1:6] 1 2 3 4 5 6
##  $ dd : num [1:6] 1 2 3 4 5 6
##  $ cc : chr [1:6] "a" "b" "c" "d" ...
##  $ ll : logi [1:6] FALSE TRUE FALSE TRUE FALSE TRUE
##  $ ff : Factor w/ 2 levels "even","odd": 2 1 2 1 2 1
##  $ fo : Ord.factor w/ 2 levels "even"<"odd": 2 1 2 1 2 1
##  $ dtm: POSIXct[1:6], format: "2022-01-25 22:28:40" "2022-01-25 22:28:41" "2022-01-25 22:28:42" ...
##  $ dat: Date[1:6], format: "2022-01-26" "2022-01-27" "2022-01-28" ...
##  $ new: Factor w/ 2 levels "FALSE","TRUE": 1 2 1 2 1 2
##  $ xcc: chr [1:6] "FALSE" "TRUE" "FALSE" "TRUE" ...
##  $ xll: logi [1:6] FALSE TRUE FALSE TRUE FALSE TRUE

Flights

bb <- xxflights
aa <- c("month", "day")
str(bb[aa])
## tibble [336,776 x 2] (S3: tbl_df/tbl/data.frame)
##  $ month: num [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ day  : num [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
# #To factor
bb$day <- as.factor(bb$day)
bb$month <- as.factor(bb$month)
# #Equivalent
#bb[aa] <- lapply(bb[aa], as.factor)
str(bb[aa])
## tibble [336,776 x 2] (S3: tbl_df/tbl/data.frame)
##  $ month: Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ day  : Factor w/ 31 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...

Re-Label

# #Unordered Named Vector of Fruits with Names of Colours
# #NOTE: First letters of each colour and fruit match
ii <- c("pink" = "papaya", "black" = "banana", "orchid" = "orange", "amber" = "apple")
ii
##     pink    black   orchid    amber 
## "papaya" "banana" "orange"  "apple"
#
# #Factor Vectors (default is Alphabetical Sorting) using unname() to remove names
fruit_base <- factor(unname(ii))
# #sort()
fruit_sort <- factor(unname(sort(ii)))
# #unique() provides the values in the sequence of their appearance 
fruit_uniq <- factor(unname(ii), levels = unique(ii))
#
# #Default Levels Match even though the actual Vectors do not Match
stopifnot(identical(levels(fruit_base), levels(fruit_sort)))
fruit_base
## [1] papaya banana orange apple 
## Levels: apple banana orange papaya
fruit_sort
## [1] apple  banana orange papaya
## Levels: apple banana orange papaya
fruit_uniq
## [1] papaya banana orange apple 
## Levels: papaya banana orange apple
#
# #Relabelling: First letters should always match between Fruits and Colours
color_base <- fruit_base
color_sort <- fruit_sort
color_uniq <- fruit_uniq
#
levels(color_base) <- names(ii)[match(color_base, ii)] #WRONG
levels(color_sort) <- names(ii)[match(color_sort, ii)]
levels(color_uniq) <- names(ii)[match(color_uniq, ii)]
#
# #CAUTION: This is WRONG. 
color_base #WRONG
## [1] amber  black  orchid pink  
## Levels: pink black orchid amber
#
color_sort 
## [1] amber  black  orchid pink  
## Levels: amber black orchid pink
color_uniq 
## [1] pink   black  orchid amber 
## Levels: pink black orchid amber

2.10 Lists

Definition 2.4 Lists are by far the most flexible data structure in R. They can be seen as a collection of elements without any restriction on the class, length or structure of each element.

Caution: The only thing you need to take care of, is that you do not give two elements the same name. R will NOT throw ERROR.

Definition 2.5 Data Frames are lists with restriction that all elements of a data frame are of equal length.

Due to the resulting two-dimensional structure, data frames can mimic some of the behaviour of matrices. You can select rows and do operations on rows. You cannot do that with lists, as a row is undefined there.

A Dataframe is intended to be used as a relational table. This means that elements in the same column are related to each other in the sense that they are all measures of the same metric. And, elements in the same row are related to each other in the sense that they are all measures from the same observation or measures of the same item. This is why when you look at the structure of a Dataframe, it will state the the number of observations and the number of variables instead of the number of rows and columns.

Dataframes are distinct from Matrices because they can include heterogenous data types among columns/variables. Dataframes do not permit multiple data types within a column/variable, for reasons that also follow from the relational table idea.

All this implies that you should use a data frame for any dataset that fits in that two-dimensional structure. Essentially, you use data frames for any dataset where a column coincides with a variable and a row coincides with a single observation in the broad sense of the word. For all other structures, lists are the way to go.

  • Does everything in R have (exactly one) class
    • Everything has (at least one) class. Objects can have multiple classes
    • It is mostly just the class attribute of an object. But when the class attribute is not set, the class() function makes up a class from the object ‘type’ and the ‘dim’ attribute.
    • lists and dataframes have same typeof ‘list’ but different class
  • Then what does typeof() tell us
    • It tells us the internal ‘storage mode’ of an object. How the R perceives the object and interacts with it.
    • An object has one and only one mode (SO) Difference between mode and class
    • class is an attribute and thus can be defined/overridden by a user, however, mode (i.e. typeof ) cannot be
  • To define an object, what should be known about it
    • class(), typeof(), is(), attributes(), str(), inherits(), …

list

# #CAUTION: Do not Create a list with duplicate names (R will NOT throw ERROR)
bb <- list(a=1, b=2, a=3)
# # 3rd index cannot be accessed using $
bb$a
## [1] 1
identical(bb$a, bb[[1]])
## [1] TRUE
identical(bb$a, bb[[3]])
## [1] FALSE
bb[[3]]
## [1] 3

class vs. typeof

# #Create a list
bb_lst <- list( a = c(1, 2), b = c('a', 'b', 'c'))
tryCatch(
# #Trying to create varying length of variables in dataframe like in list
  bb_dft <- data.frame(a = c(1, 2), b = c('a', 'b', 'c')), 
  error = function(e) {
# #Print ERROR
    cat(paste0(e))
# #Double Arrow Assignment '<<-' to assign in parent environment
    bb_dft <<- data.frame(a = c(1, 2), b = c('a', 'b'))
    }
  )
## Error in data.frame(a = c(1, 2), b = c("a", "b", "c")): arguments imply differing number of rows: 2, 3
#
# #Both list and dataframe have same type() 
typeof(bb_lst)
## [1] "list"
typeof(bb_dft)
## [1] "list"
#
# #But, class() is different for list and dataframe
class(bb_lst)
## [1] "list"
class(bb_dft)
## [1] "data.frame"
#
str(bb_lst)
## List of 2
##  $ a: num [1:2] 1 2
##  $ b: chr [1:3] "a" "b" "c"
str(bb_dft)
## 'data.frame':    2 obs. of  2 variables:
##  $ a: num  1 2
##  $ b: chr  "a" "b"
#
# #Although 'bb_lst_c' is a list but inside coercion takes place i.e. '9' is character
bb_lst_c <- list( a = c(8, 'x'), b = c('y', 9))
str(bb_lst_c[[2]][2])
##  chr "9"
#
# #Here, '9' is numeric, it is stored as list element so note the extra [[]]
bb_lst_l <- list( a = list(8, 'x'), b = list('y', 9))
str(bb_lst_l[[2]][[2]])
##  num 9

2.11 Matrix

# #Create a Matrix
bb_mat <- matrix(1:6, nrow = 2, ncol = 3)
print(bb_mat)
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
str(bb_mat)
##  int [1:2, 1:3] 1 2 3 4 5 6
class(bb_mat)
## [1] "matrix" "array"
typeof(bb_mat)
## [1] "integer"

2.12 Merge

# #Basic Tibble
bb <- xxbasic10
str(bb)
## tibble [6 x 8] (S3: tbl_df/tbl/data.frame)
##  $ ii : int [1:6] 1 2 3 4 5 6
##  $ dd : num [1:6] 1 2 3 4 5 6
##  $ cc : chr [1:6] "a" "b" "c" "d" ...
##  $ ll : logi [1:6] FALSE TRUE FALSE TRUE FALSE TRUE
##  $ ff : Factor w/ 2 levels "even","odd": 2 1 2 1 2 1
##  $ fo : Ord.factor w/ 2 levels "even"<"odd": 2 1 2 1 2 1
##  $ dtm: POSIXct[1:6], format: "2022-01-25 22:28:40" "2022-01-25 22:28:41" "2022-01-25 22:28:42" ...
##  $ dat: Date[1:6], format: "2022-01-26" "2022-01-27" "2022-01-28" ...
# #Split with 'cc' as common ID column
bb_a <- bb[1:3]
bb_b <- bb[3:ncol(bb)]
#
# #merge() using the common ID column 'cc'
bb_new <- merge(bb_a, bb_b, by = "cc")
bb_new
##   cc ii dd    ll   ff   fo                 dtm        dat
## 1  a  1  1 FALSE  odd  odd 2022-01-25 22:28:40 2022-01-26
## 2  b  2  2  TRUE even even 2022-01-25 22:28:41 2022-01-27
## 3  c  3  3 FALSE  odd  odd 2022-01-25 22:28:42 2022-01-28
## 4  d  4  4  TRUE even even 2022-01-25 22:28:43 2022-01-29
## 5  e  5  5 FALSE  odd  odd 2022-01-25 22:28:44 2022-01-30
## 6  f  6  6  TRUE even even 2022-01-25 22:28:45 2022-01-31

2.13 Sort

  • sort()
    • It sorts the vector in an ascending order
  • rank()
    • rank returns the order of each element in an ascending list
    • The smallest number receives the rank 1
    • If there are ties, it returns numeric not integer with ranks being 2.5 etc
  • order()
    • order returns the index each element would have in an ascending list
  • dplyr::arrange()
    • arrange() orders the rows of a data frame by the values of selected columns.
    • NA are always sorted to the end, even when wrapped with desc().
ERROR 2.2 Error in arrange(bb, day) : could not find function "arrange"
  • Load the Package (dplyr etc.) having the function.

order

bb <- xxflights
# #Sort ascending (default)
bb_1 <- bb[order(bb$dep_delay), ]
# #Sort descending
bb_2 <- bb[order(-bb$dep_delay), ]
#
bb[1:5, c("dep_time", "dep_delay", "tailnum", "carrier")]
## # A tibble: 5 x 4
##   dep_time dep_delay tailnum carrier
##      <dbl>     <dbl> <chr>   <chr>  
## 1      517         2 N14228  UA     
## 2      533         4 N24211  UA     
## 3      542         2 N619AA  AA     
## 4      544        -1 N804JB  B6     
## 5      554        -6 N668DN  DL
bb_1[1:5, c("dep_time", "dep_delay", "tailnum", "carrier")]
## # A tibble: 5 x 4
##   dep_time dep_delay tailnum carrier
##      <dbl>     <dbl> <chr>   <chr>  
## 1     2040       -43 N592JB  B6     
## 2     2022       -33 N612DL  DL     
## 3     1408       -32 N825AS  EV     
## 4     1900       -30 N934DL  DL     
## 5     1703       -27 N208FR  F9
bb_2[1:5, c("dep_time", "dep_delay", "tailnum", "carrier")]
## # A tibble: 5 x 4
##   dep_time dep_delay tailnum carrier
##      <dbl>     <dbl> <chr>   <chr>  
## 1      641      1301 N384HA  HA     
## 2     1432      1137 N504MQ  MQ     
## 3     1121      1126 N517MQ  MQ     
## 4     1139      1014 N338AA  AA     
## 5      845      1005 N665MQ  MQ

Multi Column

bb <- xxbasic10
bb
## # A tibble: 6 x 8
##      ii    dd cc    ll    ff    fo    dtm                 dat       
##   <int> <dbl> <chr> <lgl> <fct> <ord> <dttm>              <date>    
## 1     1     1 a     FALSE odd   odd   2022-01-25 22:28:40 2022-01-26
## 2     2     2 b     TRUE  even  even  2022-01-25 22:28:41 2022-01-27
## 3     3     3 c     FALSE odd   odd   2022-01-25 22:28:42 2022-01-28
## 4     4     4 d     TRUE  even  even  2022-01-25 22:28:43 2022-01-29
## 5     5     5 e     FALSE odd   odd   2022-01-25 22:28:44 2022-01-30
## 6     6     6 f     TRUE  even  even  2022-01-25 22:28:45 2022-01-31
# #Sort ascending (default)
(bb_1 <- bb[order(bb$ll), ])
## # A tibble: 6 x 8
##      ii    dd cc    ll    ff    fo    dtm                 dat       
##   <int> <dbl> <chr> <lgl> <fct> <ord> <dttm>              <date>    
## 1     1     1 a     FALSE odd   odd   2022-01-25 22:28:40 2022-01-26
## 2     3     3 c     FALSE odd   odd   2022-01-25 22:28:42 2022-01-28
## 3     5     5 e     FALSE odd   odd   2022-01-25 22:28:44 2022-01-30
## 4     2     2 b     TRUE  even  even  2022-01-25 22:28:41 2022-01-27
## 5     4     4 d     TRUE  even  even  2022-01-25 22:28:43 2022-01-29
## 6     6     6 f     TRUE  even  even  2022-01-25 22:28:45 2022-01-31
# #Sort on Multiple Columns with ascending and descending
(bb_2 <- bb[order(bb$ll, -bb$dd), ])
## # A tibble: 6 x 8
##      ii    dd cc    ll    ff    fo    dtm                 dat       
##   <int> <dbl> <chr> <lgl> <fct> <ord> <dttm>              <date>    
## 1     5     5 e     FALSE odd   odd   2022-01-25 22:28:44 2022-01-30
## 2     3     3 c     FALSE odd   odd   2022-01-25 22:28:42 2022-01-28
## 3     1     1 a     FALSE odd   odd   2022-01-25 22:28:40 2022-01-26
## 4     6     6 f     TRUE  even  even  2022-01-25 22:28:45 2022-01-31
## 5     4     4 d     TRUE  even  even  2022-01-25 22:28:43 2022-01-29
## 6     2     2 b     TRUE  even  even  2022-01-25 22:28:41 2022-01-27
#
stopifnot(identical(bb_2, arrange(bb, ll, -dd)))

Validation


3 Data Manipulation (B11, Sep-12)

3.1 Overview

3.2 Get Help

# #To get the Help files on any Topic including 'loaded' Packages
?dplyr
?mutate
# #To get the Help files on any Topic including functions from 'not loaded' Packages
?dplyr::mutate()
# #Operators need Backticks i.e. ` . In keyboards it is located below 'Esc' Key
?`:`

3.3 Logical Operators and Functions

  • |”      (Or, binary, vectorized)
  • ||”     (Or, binary, not vectorized)
  • &”     (And, binary, vectorized)
  • &&” (And, binary, not vectorized)
  • Functions - any(), all()

Overview

  • Vectorised forms are “&” “|”
    • Thus, these compare vectors elementwise and operate over complete vector length.
    • NA is a valid logical object. Where a component of x or y is NA, the result will be NA if the outcome is ambiguous.
    • All components of x or y are evaluated
    • (recycling) of elements occur if vector lengths are different
    • These are NOT recommended for use inside if() clauses
    • These are generally used for filtering
    • &, | do the pairwise operation in R (vs bitwise in Python, C etc.)
  • Non-vectorised forms are “&&” “||”
    • These examine only the first element of each vector
      • Caution: For these, vector length should always be 1
      • Use all() and any() to reduce the length to one
    • (short-circuit) These stop execution as soon as these find at least one definite condition i.e. TRUE for ||, FALSE for &&.
      • They will not evaluate the second operand if the first operand is enough to determine the value of the expression.
    • These are preferred in if() clauses
    • &&, || do the bitwise operation in R (vs pairwise in Python, C etc.)
  • all() and any()
    • all() : Are All Values TRUE
      • TRUE for 0-length vector
    • any() : Is at least one of the values TRUE
      • FALSE for 0-length vector
    • The value is a logical vector of length one being TRUE, FALSE, or NA.

Operators

# #At lease one TRUE is present
NA | TRUE
## [1] TRUE
# #Depending upon what the unknown is, the outcome will change
NA | FALSE
## [1] NA
# #Depending upon what the unknown is, the outcome will change
NA & TRUE
## [1] NA
# #At lease one FALSE is present
NA & FALSE 
## [1] FALSE
#
# #For length 1 vectors, output of vectorised and non-vectorised forms is same
stopifnot(all(identical(NA || TRUE, NA | TRUE), identical(NA || FALSE, NA | FALSE),
              identical(NA && TRUE, NA & TRUE), identical(NA && FALSE, NA & FALSE)))
#
# #But for vectors of >1 length, output is different
x <- 1:5
y <- 5:1
(x > 2) & (y < 3)
## [1] FALSE FALSE FALSE  TRUE  TRUE
(x > 2) && (y < 3)
## [1] FALSE
#
# # '&&' evaluates only the first element of Vector, thus caution is advised
TRUE & c(TRUE, FALSE)
## [1]  TRUE FALSE
TRUE & c(FALSE, FALSE)
## [1] FALSE FALSE
TRUE && c(TRUE, FALSE)
## [1] TRUE
TRUE && c(FALSE, FALSE)
## [1] FALSE
TRUE && all(c(TRUE, FALSE))
## [1] FALSE
TRUE && any(c(TRUE, FALSE))
## [1] TRUE

Evaluation

if(exists("x")) rm(x)
exists("x")
## [1] FALSE
#
# # No short-circuit for "|" or "&", Evaluates Right and throws Error
tryCatch( TRUE | x, error = function(e) cat(paste0(e)))
## Error in doTryCatch(return(expr), name, parentenv, handler): object 'x' not found
tryCatch( FALSE & x, error = function(e) cat(paste0(e)))
## Error in doTryCatch(return(expr), name, parentenv, handler): object 'x' not found
#
# #Does not evaluate Right input because outcome already determined
tryCatch( TRUE || x, error = function(e) cat(paste0(e)))
## [1] TRUE
tryCatch( FALSE && x, error = function(e) cat(paste0(e)))
## [1] FALSE
# #evaluates Right input because outcome cannot be determined and throws error
tryCatch( TRUE && x, error = function(e) cat(paste0(e)))
## Error in doTryCatch(return(expr), name, parentenv, handler): object 'x' not found

AnyAll

# #any()
any(NA, TRUE)
## [1] TRUE
any(NA, FALSE)
## [1] NA
any(NA, TRUE, na.rm = TRUE)
## [1] TRUE
any(NA, FALSE, na.rm = TRUE)
## [1] FALSE
any(character(0))
## [1] FALSE
#
# #all()
all(NA, TRUE)
## [1] NA
all(NA, FALSE)
## [1] FALSE
all(NA, TRUE, na.rm = TRUE)
## [1] TRUE
all(NA, FALSE, na.rm = TRUE)
## [1] FALSE
all(character(0))
## [1] TRUE

3.4 Relational Operators

\(>\) , \(<\) , \(==\) , \(>=\) , \(<=\) , \(!=\)

3.5 Filter

  • dplyr::filter()
  • subset() vs. filter() -
    • Caution: R Manual itself warns against usage of subset(). It is better to use [] for subsetting
    • Caution: NOT Verified Yet
      • subset works on matrices, however, filter does not
      • subset does not work on databases, filter does
      • subset does not drop the rownames, however, filter removes them
      • filter preserves the class of the column, subset does not
      • filter works with grouped data, subset ignores them
    • filter is stricter and thus would lead to fewer causes of unexpected outcome
  • which()
    • Takes a Boolean vector and returns a shorter vector containing the indices of the elements which were true.
    • If you want to know ‘which’ elements of a logical vector are TRUE i.e. their indices.
      • Ex: Get the position of the maximum or minimum values
    • If NA are present and you do not want them in the output
  • with()
    • with() is a wrapper for functions with no ‘data’ argument. It allows usage of function as if it had a data argument.
ERROR 3.1 Error in match.arg(method) : object ’day’ not found
  • when ‘dplyr’ package is not loaded, base::filter() throws this error.
  • Either Load the Package (dplyr etc.) or use scope resolution ‘::’

Basics

# #dplyr::filter() - Filter Rows based on Multiple Columns
bb_1 <- filter(bb, month == 1, day == 1)
dim(bb_1)
## [1] 842  19
# #Filtering by multiple criteria within a single logical expression
stopifnot(identical(bb_1, filter(bb, month == 1 & day == 1)))
#
if(anyNA(bb_1)) {
  bb_na <- na.omit(bb_1)
  print(paste0("Note: ", length(attributes(bb_na)$na.action), " rows removed."))
} else {
  print("NA not found")
}
## [1] "Note: 11 rows removed."
dim(bb_na)
## [1] 831  19

Conditional

dim(bb)
## [1] 336776     19
#
# #Flights in either months of November or Decemeber
dim(bb_2 <- filter(bb, month == 11 | month == 12))
## [1] 55403    19
#
# #Flights with arrival delay '<= 120' or departure delay '<= 120' 
# #It excludes flights where arrival & departure BOTH are delayed by >2 hours
# #If either delay is less than 2 hours, the flight is included
dim(bb_3 <- filter(bb, arr_delay <= 120 | dep_delay <= 120))
## [1] 320060     19
dim(bb_4 <- filter(bb, !(arr_delay > 120 & dep_delay > 120)))
## [1] 320060     19
dim(bb_5 <- filter(bb, (!arr_delay > 120 | !dep_delay > 120)))
## [1] 320060     19
#
# #Destination to IAH or HOU
dim(bb_6 <- filter(bb, dest == "IAH" | dest == "HOU"))
## [1] 9313   19
dim(bb_7 <- filter(bb, dest %in% c("IAH", "HOU")))
## [1] 9313   19
#
# #Carrier being "UA", "US", "DL"
dim(bb_8 <- filter(bb, carrier == "UA" | carrier == "US" | carrier == "DL"))
## [1] 127311     19
dim(bb_9 <- filter(bb, carrier %in% c("UA", "US", "DL")))
## [1] 127311     19
#
# #Did not leave late (before /on time departure) but Arrived late by >2 hours
dim(bb_10 <- filter(bb, (arr_delay > 120) & !(dep_delay > 0)))
## [1] 29 19
# 
# #Departed between midnight and 6 AM (inclusive)
dim(bb_11 <- filter(bb, (sched_dep_time >= 00 & sched_dep_time <= 600)))
## [1] 8970   19

subset()

# #subset() - Recommendation is against its usage. Use either '[]' or filter()
dim(bb_12 <- subset(bb, month == 1 | !(dep_delay >= 120), 
                    select = c("flight", "arr_delay")))
## [1] 319760      2
dim(bb_13 <- subset(bb, month == 1 | !(dep_delay >= 120) | carrier == "DL", 
                select = c("flight", "arr_delay")))
## [1] 321139      2

Filter Rows

# #Data: mtcars, 32x11, "mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb"
bb <- aa <- mtcars
#str(bb)
#summary(bb)
# #
# #Avoid subset()
ii <- subset(bb, wt > 2 & wt < 3)
# #which() 
jj <- bb[which(bb$wt > 2 & bb$wt <= 3), ]
#
# #which() select only TRUE and NOT the NA
(1:2)[which(c(TRUE, NA))]
## [1] 1
(1:2)[c(TRUE, NA)]
## [1]  1 NA
#
# #which() is faster than head() 
ee <- bb[which(bb$wt > 2 & bb$wt <= 3)[1:6], ] 
ff <- head(bb[bb$wt > 2 & bb$wt <= 3, ], 6)
stopifnot(identical(ee, ff))
#
# #Normal Filter using [] operator
kk <- bb[bb$wt > 2 & bb$wt <= 3, ]
#
# #with()
ll <- with(bb, bb[wt > 2 & wt <= 3, ])
#
# #filter()
mm <- bb %>% filter(wt > 2 & wt <= 3)
#
stopifnot(all(identical(ii, jj), identical(ii, kk), identical(ii, ll), identical(ii, mm)))
#
# #Another set of equivalent operations for OR 
ii <- subset(bb, cyl == 4 | cyl == 6)
jj <- bb[bb$cyl %in% c(4, 6), ]
kk <- bb[which(bb$cyl %in% c(4, 6)), ]
ll <- bb %>% filter(cyl == 4 | cyl == 6)
mm <- bb %>% filter(cyl %in% c(4, 6))
#
stopifnot(all(identical(ii, jj), identical(ii, kk), identical(ii, ll), identical(ii, mm)))
#
# #General Conditional Subsetting on Flights data
bb <- xxflights
dim(bb)
## [1] 336776     19
#
dim(bb[which(bb$day == 1 & !(bb$month ==1)), ])
## [1] 10194    19
dim(bb[which(bb$day == 1 | bb$month ==1), ])
## [1] 37198    19
dim(bb[which(bb$day == 1 & bb$month ==1), ])
## [1] 842  19
dim(bb[which(bb$day == 1, bb$month ==1), ])
## [1] 11036    19
dim(bb[which(bb$day == 1 & !(bb$carrier == "DL")), ])
## [1] 9482   19
dim(bb[which(bb$day == 1 | bb$carrier == "DL"), ])
## [1] 57592    19
dim(bb[which(bb$day == 1 & bb$carrier == "DL"), ])
## [1] 1554   19
dim(bb[which(bb$day == 1, bb$carrier == "DL"), ])
## [1] 11036    19

3.6 Subsetting

\([ \ \ ]\) , \([[ \ \ ]]\) , \(\$\)

  • Extract or Replace Parts of an Object
    • Operators acting on vectors, matrices, arrays and lists to extract or replace parts.
    • The most important distinction between “[ ],” “[[ ]]” and “$” is that the “[ ]” can select more than one element whereas the other two select a single element.
    • “$” does not allow computed indices, whereas “[[ ]]” does.
    • Subsetting (except by an empty index) will drop all attributes except names, dim and dimnames. Indexing will keep them.
ERROR 3.2 Error in day == 1 : comparison (1) is possible only for atomic and list types
  • It occurs when the data is not available i.e. column name is NOT found
  • It might happen when the original code assumed that the dataframe is attached
  • Either attach the dataframe (NOT Recommended) or use “$” to access column names

dplyr::select()

  • It can use Range “:” Not “!” And “&,” Or “|
  • Selection Helpers
    • everything(): Matches all variables.
    • last_col(): Select last variable, possibly with an offset.
  • These helpers select variables by matching patterns in their names:
    • starts_with(): Starts with a prefix.
    • ends_with(): Ends with a suffix.
    • contains(): Contains a literal string.
    • matches(): Matches a regular expression.
    • num_range(): Matches a numerical range like x01, x02, x03.
  • These helpers select variables from a character vector:
    • all_of(): Matches variable names in a character vector. All names must be present, otherwise an out-of-bounds error is thrown.
    • any_of(): Same as all_of(), except that no error is thrown for names that do not exist.
  • This helper selects variables with a function:
    • where(): Applies a function to all variables and selects those for which the function returns TRUE.

Cols

dim(bb)
## [1] 336776     19
#
# #Subset Consecutive Columns using Colon
stopifnot(identical(bb[ , 2:5], bb[ , -c(1, 6:ncol(bb))]))
#
# #dplyr::select()
bb_14 <- select(bb, year:day, arr_delay, dep_delay, distance, air_time)
bb_15 <- bb %>% select(year:day, arr_delay, dep_delay, distance, air_time)
stopifnot(identical(bb_14, bb_15))

3.7 Grouped Summary

  • dplyr::summarise() or dplyr::summarize()
  • dplyr::group_by()
    • It converts an existing Tibble into a grouped Tibble where operations are performed “by group.”
    • ungroup() removes grouping.
    • n() gives the number of observations in the current group.

Summarise

bb <- xxflights
# #dplyr::summarise() & dplyr::summarize() are same
# #Get the mean of a column with NA excluded
#
summarize(bb, delay_mean = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 x 1
##   delay_mean
##        <dbl>
## 1       12.6
#
# #base::summary()
summary(bb$dep_delay)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -43.00   -5.00   -2.00   12.64   11.00 1301.00    8255
#
# #Grouped Summary
by_ymd <- group_by(bb, year, month, day)
mysum <- summarize(by_ymd, 
                   dep_delay_mean = mean(dep_delay, na.rm = TRUE), 
                   arr_delay_mean = mean(arr_delay, na.rm = TRUE),
                   .groups = "keep")
# #Equivalent 
bb %>% 
  group_by(year, month, day) %>% 
  summarize(dep_delay_mean = mean(dep_delay, na.rm = TRUE), 
            arr_delay_mean = mean(arr_delay, na.rm = TRUE),
            .groups= "keep")
## # A tibble: 365 x 5
## # Groups:   year, month, day [365]
##     year month   day dep_delay_mean arr_delay_mean
##    <dbl> <dbl> <dbl>          <dbl>          <dbl>
##  1  2013     1     1          11.5          12.7  
##  2  2013     1     2          13.9          12.7  
##  3  2013     1     3          11.0           5.73 
##  4  2013     1     4           8.95         -1.93 
##  5  2013     1     5           5.73         -1.53 
##  6  2013     1     6           7.15          4.24 
##  7  2013     1     7           5.42         -4.95 
##  8  2013     1     8           2.55         -3.23 
##  9  2013     1     9           2.28         -0.264
## 10  2013     1    10           2.84         -5.90 
## # ... with 355 more rows

group_by()

# #Get delay grouped by distance 'Distance between airports, in miles.'
summary(bb$distance)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      17     502     872    1040    1389    4983
#
# #How many unique values are present in this numeric data i.e. factors
str(as.factor(bb$distance))
##  Factor w/ 214 levels "17","80","94",..: 163 165 145 171 106 96 138 22 120 99 ...
str(sort(unique(bb$distance)))
##  num [1:214] 17 80 94 96 116 143 160 169 173 184 ...
bb %>% 
  group_by(distance) %>% 
  summarize(count = n(),
            dep_delay_mean = mean(dep_delay, na.rm = TRUE), 
            arr_delay_mean = mean(arr_delay, na.rm = TRUE),
            .groups= "keep")
## # A tibble: 214 x 4
## # Groups:   distance [214]
##    distance count dep_delay_mean arr_delay_mean
##       <dbl> <int>          <dbl>          <dbl>
##  1       17     1         NaN           NaN    
##  2       80    49          18.9          16.5  
##  3       94   976          17.5          12.7  
##  4       96   607           3.19          5.78 
##  5      116   443          17.7           7.05 
##  6      143   439          23.6          14.4  
##  7      160   376          21.8          16.2  
##  8      169   545          18.5          15.1  
##  9      173   221           7.05         -0.286
## 10      184  5504           3.07          0.123
## # ... with 204 more rows
#
# #For distance =17, there is only 1 flight and that too has NA, so the mean is NaN
bb[bb$distance == 17, ]
## # A tibble: 1 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##   <dbl> <dbl> <dbl>    <dbl>          <dbl>     <dbl>    <dbl>          <dbl>     <dbl> <chr>  
## 1  2013     7    27       NA            106        NA       NA            245        NA US     
## # ... with 9 more variables: flight <dbl>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
#
# #In general, Flight to any destination (ex: ABQ) has travelled same distance (1826)
unique(bb %>% filter(dest == "ABQ") %>% select(distance))
## # A tibble: 1 x 1
##   distance
##      <dbl>
## 1     1826
#
# #Mean Delays for Destinations with more than 1000 miles distance
bb %>% 
  group_by(dest) %>% 
  filter(distance > 1000) %>% 
  summarize(count = n(), 
            distance_mean = mean(distance, na.rm = TRUE),
            dep_delay_mean = mean(dep_delay, na.rm = TRUE), 
            arr_delay_mean = mean(arr_delay, na.rm = TRUE))
## # A tibble: 48 x 5
##    dest  count distance_mean dep_delay_mean arr_delay_mean
##    <chr> <int>         <dbl>          <dbl>          <dbl>
##  1 ABQ     254         1826           13.7           4.38 
##  2 ANC       8         3370           12.9          -2.5  
##  3 AUS    2439         1514.          13.0           6.02 
##  4 BQN     896         1579.          12.4           8.25 
##  5 BUR     371         2465           13.5           8.18 
##  6 BZN      36         1882           11.5           7.6  
##  7 DEN    7266         1615.          15.2           8.61 
##  8 DFW    8738         1383.           8.68          0.322
##  9 DSM     569         1021.          26.2          19.0  
## 10 EGE     213         1736.          15.5           6.30 
## # ... with 38 more rows

3.8 Mutate

  • dplyr::mutate()
    • Newly created variables are available immediately
    • New variables overwrite existing variables of the same name.
    • Variables can be removed by setting their value to NULL.
    • mutate() adds new variables and preserves existing ones
      • mutate() can also keep or drop column according to the .keep argument.
    • transmute() adds new variables and drops existing ones.
ERROR 3.3 Error in UseMethod("select") : no applicable method for ’select’ applied to an object of class "function"
  • Run ‘str(MyObject)’ to check if ‘MyObject’ exists, looks as expected and R is not finding something else.
  • Most probably R reserved keyword ‘data’ was called in place of the actual ‘data.’
  • To minimise this type of Error, do not use the keywords which match with Base R Functions e.g. ‘data’ (Function in utils) or ‘df’ (function in stats)
ERROR 3.4 Error: Problem with mutate() column ... column object ’arr_delay’ not found
  • Run ‘str(MyObject)’ to check if the column exists in the dataset
  • Caution: if the dataset was attached earlier, then R will NOT throw this error. However, later when the code is being executed in a clean environment, it will fail. To avoid this, it is recommended to use proper scope resolution and to avoid attaching the dataset (if possible)
dim(bb)
## [1] 336776     19
#
bb_16 <- select(bb, year:day, arr_delay, dep_delay, distance, air_time)
bb_17 <- mutate(bb_16,
       gain = arr_delay - dep_delay,
       speed = distance / air_time * 60,
       hours = air_time / 60,
       gain_per_hour = gain / hours)
# #Equivalent
bb %>% 
  select(year:day, arr_delay, dep_delay, distance, air_time) %>% 
  mutate(gain = arr_delay - dep_delay,
         speed = distance / air_time * 60,
         hours = air_time / 60,
         gain_per_hour = gain / hours)
## # A tibble: 336,776 x 11
##     year month   day arr_delay dep_delay distance air_time  gain speed hours gain_per_hour
##    <dbl> <dbl> <dbl>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl> <dbl>         <dbl>
##  1  2013     1     1        11         2     1400      227     9  370. 3.78           2.38
##  2  2013     1     1        20         4     1416      227    16  374. 3.78           4.23
##  3  2013     1     1        33         2     1089      160    31  408. 2.67          11.6 
##  4  2013     1     1       -18        -1     1576      183   -17  517. 3.05          -5.57
##  5  2013     1     1       -25        -6      762      116   -19  394. 1.93          -9.83
##  6  2013     1     1        12        -4      719      150    16  288. 2.5            6.4 
##  7  2013     1     1        19        -5     1065      158    24  404. 2.63           9.11
##  8  2013     1     1       -14        -3      229       53   -11  259. 0.883        -12.5 
##  9  2013     1     1        -8        -3      944      140    -5  405. 2.33          -2.14
## 10  2013     1     1         8        -2      733      138    10  319. 2.3            4.35
## # ... with 336,766 more rows

Validation


4 Statistics (B12, Sep-26)

4.2 Definitions

16.20 A population is the set of all elements of interest in a particular study.

16.23 The process of conducting a survey to collect data for the entire population is called a census.

16.21 A sample is a subset of the population.

22.7 A random sample of size \({n}\) from an infinite population is a sample selected such that the following two conditions are satisfied. Each element selected comes from the same population. Each element is selected independently. The second condition prevents selection bias.

4.3 Inferential Statistics

16.25 Statistics uses data from a sample to make estimates and test hypotheses about the characteristics of a population through a process referred to as statistical inference.

Inferential statistics are used for Hypothesis Testing. Refer Statistical Inference

4.4 Hypothesis Testing

Refer Hypothesis Testing

24.1 Hypothesis testing is a process in which, using data from a sample, an inference is made about a population parameter or a population probability distribution.

24.2 Null Hypothesis \((H_0)\) is a tentative assumption about a population parameter. It is assumed True, by default, in the hypothesis testing procedure.

24.3 Alternative Hypothesis \((H_a)\) is the complement of the Null Hypothesis. It is concluded to be True, if the Null Hypothesis is rejected.

Refer Steps of Hypothesis Testing

  1. State the NULL Hypothesis \({H_0}\)
    • The null will always be in the form of decisions regarding the population, not the sample.
      • If we have population data, we can do the census and then there is no requirement of any hypothesis or estimation.
    • The Null Hypothesis will always be written as the absence of some parameter or process characteristic
      • The test is designed to assess the strength of the evidence against the null hypothesis.
      • Often the null hypothesis is a statement of “no difference.”
    • Equality part of expression always appears in \({H_0}\) i.e. it can be \(>=\) , \(<=\) , \(==\)
    • The term ‘null’ is used because this hypothesis assumes that there is no difference between the two means or that the recorded difference is not significant.
  2. An Alternative Hypothesis \({H_a}\), is then stated which will be the complement of the Null Hypothesis.
    • \({H_a}\) cannot have equality part of expression i.e. it can be \(<\) , \(>\) , \(!=\)
    • The claim about the population that evidence is being sought for is the alternative hypothesis
      • However, to prove it is true, its complement (null hypothesis) is tried to be proven false. Because it is easier to prove something false.
  3. For Hypothesis tests involving a population mean, let \({\mu}_0\) denote the hypothesized value

24.4 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu} \geq {\mu}_0 \iff {H_a}: {\mu} < {\mu}_0\)

24.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)

24.6 \(\text{\{Two Tail Test \} } \thinspace {H_0} :{\mu} = {\mu}_0 \iff {H_a}: {\mu} \neq {\mu}_0\)

  • Sample data is used to determine whether or not you can be statistically confident that you can reject or fail to reject the \({H_0}\).
    • If the \({H_0}\) is rejected, the statistical conclusion is that the \({H_a}\) is TRUE.
  • Notes:
    • Sometimes it is easier to formulate the alternative hypothesis (the conclusion that you hope to support) and create NULL hypothesis based on that.
    • Ex: If we are testing for validity of the claim that number of defects are less than 2%
      • \({H_a} : {\mu} < 2\% \iff {H_0} : {\mu} \geq 2\%\)
      • If the \({H_0}\) is rejected, then the statistical conclusion is that the \({H_a}\) is TRUE i.e. defects are less than 2% in the population
      • If the \({H_0}\) is not rejected, then no conclusion can be formed about the \({H_a}\).

Question: Is there an ideal sample size

  • NO
  • (“ForLater”) However, there exists a relationship between (I guess) alpha, beta and sample size n. (I could not find the link on later search.)
  • (Paraphrasing and only memory based so can be worng!) Basically, for a given analysis, if we want to keep both types of errors to a managable level, we can calculate minimum number of samples that would help us in determining the outcome at a certain minimum confidence level etc.

4.5 Point Estimation

22.9 To estimate the value of a population parameter, we compute a corresponding characteristic of the sample, referred to as a sample statistic. This process is called point estimation.

22.10 A sample statistic is the point estimator of the corresponding population parameter. For example, \(\overline{x}, s, s^2, s_{xy}, r_{xy}\) sample statics are point estimators for corresponding population parameters of \({\mu}\) (mean), \({\sigma}\) (standard deviation), \(\sigma^2\) (variance), \(\sigma_{xy}\) (covariance), \(\rho_{xy}\) (correlation)

22.11 The numerical value obtained for the sample statistic is called the point estimate. Estimate is used for sample value only, for population value it would be parameter. Estimate is a value while Estimator is a function.

Example: \({\overline{x}}\) is an estimator (of populataion parameter ‘mean’ \({\mu}\)). Its estimate is 3 and this calculation process is an estimation.

4.6 Standard Deviation

18.8 Given a data set \({X = \{{x}_1, {x}_2, \ldots, {x}_n\}}\), the mean \({\overline{x}}\) is the sum of all of the values \({{x}_1, {x}_2, \ldots, {x}_n}\) divided by the count \({n}\).

Refer Standard Deviation and equation (18.12)

18.15 The standard deviation (\(s, \sigma\)) is defined to be the positive square root of the variance. It is a measure of the amount of variation or dispersion of a set of values.

\[\begin{align} \sigma &= \sqrt{\frac{1}{N} \sum_{i=1}^N \left(x_i - \mu\right)^2} \\ {s} &= \sqrt{\frac{1}{N-1} \sum_{i=1}^N \left(x_i - \overline{x}\right)^2} \end{align}\]

A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range.

4.7 Variance

Refer Variance and equation (18.11)

18.14 The variance \(({\sigma}^2)\) is based on the difference between the value of each observation \({x_i}\) and the mean \({\overline{x}}\). The average of the squared deviations is called the variance.

\[\begin{align} \sigma^2 &= \frac{1}{n} \sum _{i=1}^{n} \left(x_i - \mu \right)^2 \\ s^2 &= \frac{1}{n-1} \sum _{i=1}^{n} \left(x_i - \overline{x} \right)^2 \end{align}\]

Variability is most commonly measured with the Range, IQR, SD, and Variance.

4.8 Standard Error or Sampling Fluctuation

The sample we draw from the population is only one from a large number of potential samples.

  • If ten researchers were all studying the same population, drawing their own samples then they may obtain different answers i.e. each of the ten researchers may come up with a different mean
  • Thus, the statistic in question (mean) varies for sample to sample. It has a distribution called a sampling distribution.
  • We can use this distribution to understand the uncertainty in our estimate of the population parameter.

Refer Standard Error

22.13 In general, standard error \(\sigma_{\overline{x}}\) refers to the standard deviation of a point estimator. The standard error of \({\overline{x}}\) is the standard deviation of the sampling distribution of \({\overline{x}}\). It is the indicator of ‘Sampling Fluctuation.’

22.14 A sampling error is the difference between a population parameter and a sample statistic.

Sampling fluctuation (Standard Error) refers to the extent to which a statistic (mean, median, mode, sd etc.) takes on different values with different samples i.e. it refers to how much the value of the statistic fluctuates from sample to sample.

22.12 The sampling distribution of \({\overline{x}}\) is the probability distribution of all possible values of the sample mean \({\overline{x}}\).

Standard Deviation of \({\overline{x}}\), \(\sigma_{\overline{x}}\) is given by equation (22.1) i.e. \(\sigma_{\overline{x}} = \frac{\sigma}{\sqrt{n}}\)

  • Generally, the standard error is unknown.
  • Higher the standard error, higher the deviation from sample to sample i.e. lower the reliability.

4.9 Test Statistic

Refer Test Statistic

24.11 Test statistic is a number calculated from a statistical test of a hypothesis. It shows how closely the observed data match the distribution expected under the null hypothesis of that statistical test. It helps determine whether a null hypothesis should be rejected.

24.14 If \({\sigma}\) is known, the standard normal random variable \({z}\) is used as test statistic to determine whether \({\overline{x}}\) deviates from the hypothesized value of \({\mu}\) enough to justify rejecting the null hypothesis. Refer equation (24.1) \(\to z = \frac{\overline{x} - {\mu}_0}{{\sigma}_{\overline{x}}} = \frac{\overline{x} - {\mu}_0}{{\sigma}/\sqrt{n}}\)

4.10 Calculate SD & SE

Standard Error (SE) is same as ‘the standard deviation of the sampling distribution.’ The ‘variance of the sampling distribution’ is the Variance of the data divided by N.

Calculate Statistics

# #DataSet: Height of 5 people in 'cm'
hh <- c(170.5, 161, 160, 170, 150.5)
#
# #N by length()
print(hh_len <- length(hh))
## [1] 5
#
# #Mean by mean()
hh_mean <- mean(hh)
cat("Mean = ", hh_mean)
## Mean =  162.4
#
# #Variance by var()
hh_var <- round(var(hh), 3)
cat("Variance = ", hh_var)
## Variance =  68.175
#
# #Standard Deviation (SD) by sd()
hh_sd <- round(sd(hh), 3)
cat("Standard Deviation (SD) = ", hh_sd)
## Standard Deviation (SD) =  8.257
#
# #Standard Error (SE) 
hh_se_sd <- round(hh_sd / sqrt(hh_len), 3)
cat("Standard Error (SE) = ", hh_se_sd)
## Standard Error (SE) =  3.693

R Functions

# #DataSet: Height of 5 people in 'cm'
print(hh)
## [1] 170.5 161.0 160.0 170.0 150.5
#
# #N by length()
print(hh_len <- length(hh))
## [1] 5
#
# #sum by sum()
print(hh_sum <- sum(hh))
## [1] 812
#
# #Mean by mean()
hh_mean <- mean(hh)
hh_mean_cal <- hh_sum / hh_len
stopifnot(identical(hh_mean, hh_mean_cal))
cat("Mean = ", hh_mean)
## Mean =  162.4
#
# #Calculate the deviation from the mean by subtracting each value from the mean
print(hh_dev <- hh - hh_mean)
## [1]   8.1  -1.4  -2.4   7.6 -11.9
#
# #Square the deviation
print(hh_sqdev <- hh_dev^2)
## [1]  65.61   1.96   5.76  57.76 141.61
#
# #Get Sum of the squared deviations
print(hh_sqdev_sum <- sum(hh_sqdev))
## [1] 272.7
#
# #Divide it by the 'sample size (N) – 1' for the Variance or use var()
hh_var <- round(var(hh), 3)
hh_var_cal <- hh_sqdev_sum / (hh_len -1)
stopifnot(identical(hh_var, hh_var_cal))
cat("Variance = ", hh_var)
## Variance =  68.175
#
# #Variance of the sampling distribution 
hh_var_sample <- hh_var / hh_len
cat("Variance of the Sampling Distribution = ", hh_var)
## Variance of the Sampling Distribution =  68.175
#
# #Take square root of the Variance for the Standard Deviation (SD) or use sd()
hh_sd_cal <- round(sqrt(hh_var), 3)
hh_sd <- sd(hh)
stopifnot(identical(round(hh_sd, 3), hh_sd_cal))
cat("Standard Deviation (SD) = ", hh_sd)
## Standard Deviation (SD) =  8.256815
#
# #Standard Error (SE)
# #SE
# #Divide the SD by the square root of the sample size for the Standard Error (SE)
# #
hh_se_sd <- round(hh_sd / sqrt(hh_len), 3)
#
# #Calculate SE from Variance 
hh_se_var <- round(sqrt(hh_var_sample), 3)
stopifnot(identical(hh_se_sd, hh_se_var))
cat("Standard Error (SE) = ", hh_se_sd)
## Standard Error (SE) =  3.693

4.11 Histogram and Density

Using Dataset Flights : “air_time” -Amount of time spent in the air, in minutes. Refer figure 4.1

Graphs

Flights: Air Time (min) excluding NA (Histogram and Density)Flights: Air Time (min) excluding NA (Histogram and Density)

Figure 4.1 Flights: Air Time (min) excluding NA (Histogram and Density)

NA

# #Remove All NA
aa <- na.omit(xxflights$air_time)
attr(aa, "na.action") <- NULL
str(aa)
##  num [1:327346] 227 227 160 183 116 150 158 53 140 138 ...
summary(aa)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    20.0    82.0   129.0   150.7   192.0   695.0

Stats

# #Overview of Data after removal of NA
bb <- aa
stopifnot(is.null(dim(bb)))
summary(bb)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    20.0    82.0   129.0   150.7   192.0   695.0
# #min(), max(), range(), summary()
min_bb <- summary(bb)[1]
max_bb <- summary(bb)[6]
range_bb <- max_bb - min_bb
cat(paste0("Range = ", range_bb, " (", min_bb, ", ", max_bb, ")\n"))
## Range = 675 (20, 695)
# #IQR(), summary()
iqr_bb <- IQR(bb)
cat(paste0("IQR = ", iqr_bb, " (", summary(bb)[2], ", ", summary(bb)[5], ")\n"))
## IQR = 110 (82, 192)
# #median(), mean(), summary()[3], summary()[4] 
median_bb <- median(bb)
cat("Median =", median_bb, "\n")
## Median = 129
mu_mean_bb <- mean(bb)
cat("Mean \u03bc =", mu_mean_bb, "\n")
## Mean µ = 150.6865
#
sigma_sd_bb <- sd(bb)
cat("SD (sigma) \u03c3 =", sigma_sd_bb, "\n")
## SD (sigma) s = 93.6883
#
variance_bb <- var(bb)
cat(sprintf('Variance (sigma)%s %s%s =', '\u00b2', '\u03c3', '\u00b2'), variance_bb, "\n")
## Variance (sigma)² s² = 8777.498

Historgram

# #Histogram
bb <- na.omit(xxflights$air_time)
hh <- tibble(ee = bb)
# #Basics
median_hh <- round(median(hh[[1]]), 1)
mean_hh <- round(mean(hh[[1]]), 1)
sd_hh <- round(sd(hh[[1]]), 1)
len_hh <- nrow(hh)
#
B12P01 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) + 
  geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') + 
  geom_vline(aes(xintercept = mean_hh), color = '#440154FF') +
  geom_text(data = tibble(x = mean_hh, y = -Inf, 
                          label = paste0("Mean= ", mean_hh)), 
            aes(x = x, y = y, label = label), 
            color = '#440154FF', hjust = -0.5, vjust = 1.3, angle = 90) +
  geom_vline(aes(xintercept = median_hh), color = '#3B528BFF') +
  geom_text(data = tibble(x = median_hh, y = -Inf, 
                          label = paste0("Median= ", median_hh)), 
            aes(x = x, y = y, label = label), 
            color = '#3B528BFF', hjust = -0.5, vjust = -0.7, angle = 90) +
  theme(plot.title.position = "panel") + 
  labs(x = "x", y = "Frequency", 
       subtitle = paste0("(N=", len_hh, "; ", "Mean= ", mean_hh, 
                         "; Median= ", median_hh, "; SD= ", sd_hh,
                         ")"), 
        caption = "B12P01", title = "Flights: Air Time")
}

Density

# #Density Curve
# #Get Quantiles and Ranges of mean +/- sigma 
q05_hh <- quantile(hh[[1]], .05)
q95_hh <- quantile(hh[[1]], .95)
density_hh <- density(hh[[1]])
density_hh_tbl <- tibble(x = density_hh$x, y = density_hh$y)
sig3r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 3 * sd_hh})
sig3l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 3 * sd_hh})
sig2r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 2 * sd_hh}, {x < mean_hh + 3 * sd_hh})
sig2l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 2 * sd_hh}, {x > mean_hh - 3 * sd_hh})
sig1r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + sd_hh}, {x < mean_hh + 2 * sd_hh})
sig1l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - sd_hh}, {x > mean_hh - 2 * sd_hh})
sig0r_hh <- density_hh_tbl %>% filter(x > mean_hh, {x < mean_hh + 1 * sd_hh})
sig0l_hh <- density_hh_tbl %>% filter(x < mean_hh, {x > mean_hh - 1 * sd_hh})
#
# #Change x-Axis Ticks interval
xbreaks_hh <- seq(-3, 3)
xpoints_hh <- mean_hh + xbreaks_hh * sd_hh
#
# # Latex Labels 
xlabels_hh <- c(TeX(r'($\,\,\mu - 3 \sigma$)'), TeX(r'($\,\,\mu - 2 \sigma$)'), 
                TeX(r'($\,\,\mu - 1 \sigma$)'), TeX(r'($\mu$)'), TeX(r'($\,\,\mu + 1 \sigma$)'), 
                TeX(r'($\,\,\mu + 2 \sigma$)'), TeX(r'($\,\,\mu + 3\sigma$)'))
#
B12P02 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) + 
  geom_density(alpha = 0.2, colour = "#21908CFF") + 
  geom_area(data = sig3l_hh, aes(x = x, y = y), fill = '#440154FF') + 
  geom_area(data = sig3r_hh, aes(x = x, y = y), fill = '#440154FF') + 
  geom_area(data = sig2l_hh, aes(x = x, y = y), fill = '#3B528BFF') + 
  geom_area(data = sig2r_hh, aes(x = x, y = y), fill = '#3B528BFF') + 
  geom_area(data = sig1l_hh, aes(x = x, y = y), fill = '#21908CFF') + 
  geom_area(data = sig1r_hh, aes(x = x, y = y), fill = '#21908CFF') + 
  geom_area(data = sig0l_hh, aes(x = x, y = y), fill = '#5DC863FF') + 
  geom_area(data = sig0r_hh, aes(x = x, y = y), fill = '#5DC863FF') + 
  #scale_y_continuous(limits = c(0, 0.009), breaks = seq(0, 0.009, 0.003)) +
  scale_x_continuous(breaks = xpoints_hh, labels = xlabels_hh) + 
  ggplot2::annotate("segment", x = xpoints_hh[4] - 0.5 * sd_hh, xend = xpoints_hh[2], y = 0.007, 
                    yend = 0.007, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) + 
  ggplot2::annotate("segment", x = xpoints_hh[4] + 0.5 * sd_hh, xend = xpoints_hh[6], y = 0.007, 
                    yend = 0.007, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) + 
  ggplot2::annotate(geom = "text", x = xpoints_hh[4], y = 0.007, label = "95.4%") + 
  theme(plot.title.position = "panel") + 
  labs(x = "x", y = "Density", 
       subtitle = paste0("(N=", nrow(.), "; ", "Mean= ", round(mean(.[[1]]), 1), 
                         "; Median= ", round(median(.[[1]]), 1), "; SD= ", round(sd(.[[1]]), 1),
                         ")"), 
        caption = "B12P02", title = "Flights: Air Time")
}

Aside

  • This section is NOT useful for general reader and can be safely ignored. It contains my notes related to building this book. These are useful only for someone who is building his own book. (Shivam)
  • Side by Side Images need a Caption in Final Chunk
  • LaTex Inside Tex() will not be able to execute braces as usual, avoid them or escape them

4.12 Effect of Sample Size and Repeat Sampling

Using Dataset Flights : “air_time” -Amount of time spent in the air, in minutes.

  1. Effect of increasing sample size (N =100, 1000, 10000), Refer figure 4.2
    • the precision and confidence in the estimate increases and uncertainty decreases
    • the distribution of sample means become thinner. i.e. the sample standard deviation decreases
  2. Effect of increasing the Sampling, Refer figure 4.4
    • The mean of the distribution of sample means equals the mean of the parent distribution.
    • Refer Standard Error

Caution: Trend here does not match with the theory. However, the exercise shows the ‘How to do it’ part. It can be repeated with better data, larger sample size, or repeat sampling.

4.12.1 Sample Size

GIF

Effect of Increasing Sample Size

Figure 4.2 Effect of Increasing Sample Size

Images

Effect of Increasing Sample SizeEffect of Increasing Sample SizeEffect of Increasing Sample Size

Figure 4.3 Effect of Increasing Sample Size

Code

bb <- na.omit(xxflights$air_time)
# #Pseudo Random Number Generation by set.seed() 
set.seed(3)
# #Set Sample Size
#nn <- 100L
# #Take a sample from dataset
xb100 <- sample(bb, size = 100L)
xb1000 <- sample(bb, size = 1000L)
xb10000 <- sample(bb, size = 10000L)
# #Population Mean
mu_hh <- round(mean(bb), 1)
# #Histogram: N = 100
hh <- tibble(ee = xb100)
ylim_hh <- 12.5
cap_hh <- "B12P03"
# #Assumes 'hh' has data in 'ee'. In: mu_hh, cap_hh, ylim_hh
#
B12 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) + 
  geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') + 
  geom_vline(aes(xintercept = mean(.data[["ee"]])), color = '#440154FF') +
  geom_text(aes(label = TeX(r'($\bar{x}$)', output = "character"), 
                x = mean(.data[["ee"]]), y = -Inf), 
            color = '#440154FF', hjust = 2, vjust = -2.5, parse = TRUE, check_overlap = TRUE) + 
  geom_vline(aes(xintercept = mu_hh), color = '#3B528BFF') +
  geom_text(aes(label = TeX(r'($\mu$)', output = "character"), x = mu_hh, y = -Inf),
            color = '#3B528BFF', hjust = -1, vjust = -2, parse = TRUE, check_overlap = TRUE) + 
  coord_cartesian(xlim = c(0, 800), ylim = c(0, ylim_hh)) + 
  theme(plot.title.position = "panel") + 
  labs(x = "x", y = "Frequency", 
       subtitle = paste0("(Mean= ", round(mean(.[[1]]), 1), 
                         "; SD= ", round(sd(.[[1]]), 1),
                         #"; Var= ", round(var(.[[1]]), 1),
                         "; SE= ", round(sd(.[[1]]) / sqrt(nrow(.)), 1),
                         ")"), 
      caption = cap_hh, title = paste0("Sample Size = ", nrow(.)))
}
assign(cap_hh, B12)
rm(B12)

Warnings

  • “In mean.default(gg) : argument is not numeric or logical: returning NA”
    • For ggplot() - This comes up if the object ‘gg’ is NULL. Check if the ggplot is looking into global scope in place of local dataframe that was passed.

Deprecated

4.12.2 Repeat Sampling

GIF

Effect of Increasing Sample Size

Figure 4.4 Effect of Increasing Sample Size

Images

Effect of Increasing SamplingEffect of Increasing SamplingEffect of Increasing Sampling

Figure 4.5 Effect of Increasing Sampling

Code

bb <- na.omit(xxflights$air_time)
# #Pseudo Random Number Generation by set.seed() 
set.seed(3)
# #Set Sample Size
nn <- 10L
# #Set Repeat Sampling Rate
rr <- 20L
# #Take Sample of N = 10, get mean, repeat i.e. get distribution of mean
xr20 <- replicate(rr, mean(sample(bb, size = nn)))
rr <- 200L
xr200 <- replicate(rr, mean(sample(bb, size = nn)))
rr <- 2000L
xr2000 <- replicate(rr, mean(sample(bb, size = nn)))
#
# #Population Mean
mu_hh <- round(mean(bb), 1)
# #Histogram: N = 10, Repeat = 20
hh <- tibble(ee = xr20)
ylim_hh <- 2
cap_hh <- "B12P06"
# #Assumes 'hh' has data in 'ee'. In: mu_hh, cap_hh, ylim_hh, nn
#
B12 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) + 
  geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') + 
  geom_vline(aes(xintercept = mean(.data[["ee"]])), color = '#440154FF') +
  geom_text(aes(label = TeX(r'($E(\bar{x})$)', output = "character"), 
                x = mean(.data[["ee"]]), y = -Inf), 
            color = '#440154FF', hjust = 1.5, vjust = -1.5, parse = TRUE, check_overlap = TRUE) + 
  geom_vline(aes(xintercept = mu_hh), color = '#3B528BFF') +
  geom_text(aes(label = TeX(r'($\mu$)', output = "character"), x = mu_hh, y = -Inf),
            color = '#3B528BFF', hjust = -1, vjust = -2, parse = TRUE, check_overlap = TRUE) + 
  coord_cartesian(xlim = c(0, 800), ylim = c(0, ylim_hh)) + 
  theme(plot.title.position = "panel") + 
  labs(x = TeX(r'($\bar{x} \, (\neq x)$)'), y = TeX(r'(Frequency of $\, \bar{x}$)'), 
       subtitle = TeX(sprintf(
         "($\\mu$=%.0f) $E(\\bar{x}) \\, =$%.0f; $\\sigma_{\\bar{x}} \\, =$%.0f",
                             mu_hh, round(mean(.[[1]]), 1), round(sd(.[[1]])))),
       caption = cap_hh, 
       title = paste0("Sampling Distribution (N = ", nn, ") & Repeat Sampling = ", nrow(.)))
}
assign(cap_hh, B12)
rm(B12)

4.13 Normal Distribution

Normal Distribution

Figure 4.6 Normal Distribution

Refer Normal Distribution and equation (21.2)

21.3 A normal distribution (\({\mathcal{N}}_{({\mu}, \, {\sigma}^2)}\)) is a type of continuous probability distribution for a real-valued random variable.

Their importance is partly due to the Central Limit Theorem. Assumption of normal distribution allow us application of Parametric Methods

33.1 Parametric methods are the statistical methods that begin with an assumption about the probability distribution of the population which is often that the population has a normal distribution. A sampling distribution for the test statistic can then be derived and used to make an inference about one or more parameters of the population such as the population mean \({\mu}\) or the population standard deviation \({\sigma}\).

22.15 Central Limit Theorem: In selecting random samples of size \({n}\) from a population, the sampling distribution of the sample mean \({\overline{x}}\) can be approximated by a normal distribution as the sample size becomes large.

It states that, under some conditions, the average of many samples (observations) of a random variable with finite mean and variance is itself a random variable—whose distribution converges to a normal distribution as the number of samples increases.

Parametric statistical tests typically assume that samples come from normally distributed populations, but the central limit theorem means that this assumption is not necessary to meet when you have a large enough sample. A sample size of 30 or more is generally considered large.

This is the basis of Empirical Rule.

18.23 Empirical rule is used to compute the percentage of data values that must be within one, two, and three standard deviations \({\sigma}\) of the mean \({\mu}\) for a normal distribution. These probabilities are Pr(x) 68.27%, 95.45%, and 99.73%.

Caution: If data from small samples do not closely follow this pattern, then other distributions like the t-distribution may be more appropriate.

4.14 Standard Normal Distribution

Refer Standard Normal and equation (21.3)

21.4 A random variable that has a normal distribution with a mean of zero \(({\mu} = 0)\) and a standard deviation of one \(({\sigma} = 1)\) is said to have a standard normal probability distribution. The z-distribution is given by \({\mathcal{z}}_{({\mu} = 0, \, {\sigma} = 1)}\)

The simplest case of a normal distribution is known as the standard normal distribution. Given the Population with normal distribution \({\mathcal{N}}_{(\mu, \, \sigma)}\)

If \(\overline{X}\) is the mean of a sample of size \({n}\) from this population, then the standard error is \(\sigma/{\sqrt{n}}\) and thus the z-score is \(Z=\frac {\overline{X} - \mu}{\sigma/{\sqrt{n}}}\)

The z-score is the test statistic used in a z-test. The z-test is used to compare the means of two groups, or to compare the mean of a group to a set value. Its null hypothesis typically assumes no difference between groups.

The area under the curve to the right of a z-score is the p-value, and it is the likelihood of your observation occurring if the null hypothesis is true.

Usually, a p-value of 0.05 or less means that your results are unlikely to have arisen by chance; it indicates a statistically significant effect.

4.15 Outliers

Refer Outliers: C03

18.24 Sometimes unusually large or unusually small values are called outliers. It is a data point that differs significantly from other observations.

  • Question: If we include a datapoint which is 4 standard deviations away, would we be able to get the Normal Distribution
    • Shape of the curve will be tilted, thus it will be difficult to keep the datapoint and satify the condition for normality
    • Generally, only \({{\mu} - 3{\sigma} \leq {x} \leq {\mu} + 3{\sigma}}\) values are kept and the remaining are treated as outliers
  • Question: Is is a bad data if it is 4 standard deviations away
    • It means that if we keep the data point, there is a high possibility that we will violate the normality assumption. If we violate the assuption, parametric methods cannot be applied to the dataset
    • In general, convert to z-value, remove those which have z-value higher than +3 or lower than -3
  • Question: But, how many removals are too many removals
    • There are techniques for this consideration, will be covered later. “ForLater”
    • (Aside) In a sample of 1000 observations, the presence of up to five observations deviating from the mean by more than three times the standard deviation is within the range of what can be expected. If the sample size is only 100, however, just three such outliers are already reason for concern.
  • Concern: Frequency or Proportion of outliers should not be very high
    • It is true that we cannot have normal distribution in that case.
      • However, can we afford to remove all the data points with \(z > 3\), this needs to be answered in the context of analysis.
        • Here the ‘assignable cause’ is applied. i.e. each datapoint that is proposed to be an outlier is individually analysed and either kept or removed
  • Concern: Sometimes the outliers are present because the dataset is a mixture of two distributions
    • In that case, those should be treated separately
  • Question: Are there tools for all of this jugglery
    • Yes, there are, specially nonparametric methods does not take any assumption about distribution.
    • However, these are not as robust as parametric tests, so if possible, stay with parametric tests

4.16 Type I and Type II Errors

Type-I $(\alpha)$ and Type-II $(\beta)$ Errors

Figure 4.7 Type-I \((\alpha)\) and Type-II \((\beta)\) Errors

Example

  • Type-I “An innocent person is convicted”
  • Type-II “A guilty person is not convicted”

Since we are using sample data to make inferences about the population, it is possible that we will make an error. In the case of the Null Hypothesis, we can make one of two errors.

Refer Type I and Type II Errors

24.7 The error of rejecting \({H_0}\) when it is true, is Type I error \(({\alpha})\).

24.8 The error of accepting \({H_0}\) when it is false, is Type II error \(({\beta})\).

24.9 The level of significance \((\alpha)\) is the probability of making a Type I error when the null hypothesis is true as an equality.

23.3 The confidence level expressed as a decimal value is the confidence coefficient \(({\gamma} = 1 - {\alpha})\). i.e. 0.95 is the confidence coefficient for a 95% confidence level.

24.28 The probability of correctly rejecting \({H_0}\) when it is false is called the power of the test. For any particular value of \({\mu}\), the power is \(1 - \beta\).

There is always a tradeoff between Type-I and Type-II errors.

  • Generally max 5% \({\alpha}\) and max 20% \({\beta}\) errors are recommended

In practice, the person responsible for the hypothesis test specifies the level of significance. By selecting \({\alpha}\), that person is controlling the probability of making a Type I error.

  • If the cost of making a Type I error is high, small values of \({\alpha}\) are preferred. Ex: \(\alpha =0.01\)
  • If the cost of making a Type I error is not too high, larger values of \({\alpha}\) are typically used. Ex: \(\alpha = 0.05\)

24.10 Applications of hypothesis testing that only control for the Type I error \((\alpha)\) are called significance tests.

Although most applications of hypothesis testing control for the probability of making a Type I error, they do not always control for the probability of making a Type II error. Because of the uncertainty associated with making a Type II error when conducting significance tests, statisticians usually recommend that we use the statement "do not reject \({H_0}\)" instead of “accept \({H_0}\).”

4.17 Critical Value

Left Tail vs. Right TailLeft Tail vs. Right Tail

Figure 4.8 Left Tail vs. Right Tail

Two Tail

Figure 4.9 Two Tail

24.18 Critical value is the value that is compared with the test statistic to determine whether \({H_0}\) should be rejected. Significance level \({\alpha}\), or confidence level (\(1 - {\alpha}\)), dictates the critical value (\(Z\)), or critical limit. Ex: For Upper Tail Test, \(Z_{{\alpha} = 0.05} = 1.645\).

# #Critical Value (z) for Common Significance level Alpha (α) or Confidence level (1-α)
xxalpha <- c("10%" = 0.1, "5%" = 0.05, "5/2%" = 0.025, "1%" = 0.01, "1/2%" = 0.005)
#
# #Left Tail Test
round(qnorm(p = xxalpha, lower.tail = TRUE), 4)
##     10%      5%    5/2%      1%    1/2% 
## -1.2816 -1.6449 -1.9600 -2.3263 -2.5758
#
# #Right Tail Test
round(qnorm(p = xxalpha, lower.tail = FALSE), 4)
##    10%     5%   5/2%     1%   1/2% 
## 1.2816 1.6449 1.9600 2.3263 2.5758

24.16 A p-value is a probability that provides a measure of the evidence against the null hypothesis provided by the sample. The p-value is used to determine whether the null hypothesis should be rejected. Smaller p-values indicate more evidence against \({H_0}\).

24.19 A acceptance region (confidence interval), is a set of values for the test statistic for which the null hypothesis is accepted. i.e. if the observed test statistic is in the confidence interval then we accept the null hypothesis and reject the alternative hypothesis.

24.21 A rejection region (critical region), is a set of values for the test statistic for which the null hypothesis is rejected. i.e. if the observed test statistic is in the critical region then we reject the null hypothesis and accept the alternative hypothesis.

4.18 Tailed Tests

24.12 A one-tailed test and a two-tailed test are alternative ways of computing the statistical significance of a parameter inferred from a data set, in terms of a test statistic.

One tailed-tests are concerned with one side of a statistic. Whereas, Two-tailed tests deal with both tails of the distribution.

Two-tail test is done when you do not know about direction, so you test for both sides.

24.4 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu} \geq {\mu}_0 \iff {H_a}: {\mu} < {\mu}_0\)

24.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)

24.6 \(\text{\{Two Tail Test \} } \thinspace {H_0} :{\mu} = {\mu}_0 \iff {H_a}: {\mu} \neq {\mu}_0\)

4.19 Approaches

24.15 The p-value approach uses the value of the test statistic \({z}\) to compute a probability called a p-value.

Steps for the p-value approach or test statistic approach

  • Calculate \(z\) for given \(x\): \(z = \frac{\overline{x} - \mu_0}{s}\)
  • Refer Get P(z) by pnorm() or z by qnorm(), to get the p-value from z-table
    • \(P_{\left(\overline{x}\right)} = P_{\left(z\right)}\)
  • Compare p-value with Level of significance \({\alpha}\)

24.17 The critical value approach requires that we first determine a value for the test statistic called the critical value.

Steps for the critical value approach

  • Calculate \(z\) for given \(x\): \(z = \frac{\overline{x} - \mu_0}{s}\)
  • Using the z-table, find the z for given Level of significance \({\alpha} = 0.01\)
  • Compare test statistic with z-value i.e. \((z)\) vs. \((z_{\alpha = 0.01})\)

4.20 z-test vs. t-test

If the population standard error (SE) is known, apply z-test. If it is unknown, apply t-test. t-test will converge to z-test with increasing sample size.

Question: Does the probability from t-table differ from the probability value from z-table

  • No, practically for sample size greater than 30, there is no difference

It is assumed that \((\overline{x} - \mu)\) follows Normality. However the Standard Error (SE) does not follow normality, generally it follows chi-sq distribution. Thus, \((\overline{x} - \mu)/SE\) becomes ‘Normal/ChiSq’ and this ratio follows the t-distribution. Thus, the test we apply is called t-test.

# #For Degrees of Freedom = 10 (N=11)
# #Critical Value (z) for Common Significance level Alpha (α) or Confidence level (1-α)
xxalpha <- c("10%" = 0.1, "5%" = 0.05, "5/2%" = 0.025, "1%" = 0.01, "1/2%" = 0.005)
dof <- 10L
#
# #Left Tail Test
round(qt(p = xxalpha, df = dof, lower.tail = TRUE), 4)
##     10%      5%    5/2%      1%    1/2% 
## -1.3722 -1.8125 -2.2281 -2.7638 -3.1693
#
# #Right Tail Test
round(qt(p = xxalpha, df = dof, lower.tail = FALSE), 4)
##    10%     5%   5/2%     1%   1/2% 
## 1.3722 1.8125 2.2281 2.7638 3.1693

4.21 t-test

4.21.1 Degrees of Freedom

23.5 The number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary. In general, the degrees of freedom of an estimate of a parameter are \((n - 1)\).

Why \((n-1)\) are the degrees of freedom

  • Degrees of freedom refer to the number of independent pieces of information that go into the computation. i.e. \(\{(x_{1}-\overline{x}), (x_{2}-\overline{x}), \ldots, (x_{n}-\overline{x})\}\)
  • However, \(\sum (x_{i}-\overline{x}) = 0\) for any data set.
  • Thus, only \((n − 1)\) of the \((x_{i}-\overline{x})\) values are independent.
    • if we know \((n − 1)\) of the values, the remaining value can be determined exactly by using the condition.

Question: Is there any minimum sample size we must consider before calculating degrees of freedom

  • Larger sample sizes are needed if the distribution of the population is highly skewed or includes outliers.

Guess: Degrees of freedom is also calculated to remove the possible bias

  • No

4.21.2 How to use t-table

  • Rows have degrees of freedom, Columns have \({\alpha}\) values, get the t-statistic at their intersection
    • For DOF = 10, and \({\alpha} = 0.05\), t-table has value 1.812 (Critical Limit)
    • In right tail test, if the test-statistic is greater than critical limit, we can reject the null

Validation


5 Statistics (B13, Oct-03)

5.1 Overview

  • “Introduction to Statistics”

5.2 Definitions

Type-I $(\alpha)$ and Type-II $(\beta)$ Errors

Figure 5.1 Type-I \((\alpha)\) and Type-II \((\beta)\) Errors

Refer Type I and Type II Errors (B12) & Type I and Type II Errors

24.7 The error of rejecting \({H_0}\) when it is true, is Type I error \(({\alpha})\).

24.8 The error of accepting \({H_0}\) when it is false, is Type II error \(({\beta})\).

24.9 The level of significance \((\alpha)\) is the probability of making a Type I error when the null hypothesis is true as an equality.

23.3 The confidence level expressed as a decimal value is the confidence coefficient \(({\gamma} = 1 - {\alpha})\). i.e. 0.95 is the confidence coefficient for a 95% confidence level.

24.28 The probability of correctly rejecting \({H_0}\) when it is false is called the power of the test. For any particular value of \({\mu}\), the power is \(1 - \beta\).

24.10 Applications of hypothesis testing that only control for the Type I error \((\alpha)\) are called significance tests.

24.23 p-value Approach: Form Hypothesis | Specify \({\alpha}\) | Calculate test statistic | Calculate p-value | Compare p-value with \({\alpha}\) | Interpret

22.13 In general, standard error \(\sigma_{\overline{x}}\) refers to the standard deviation of a point estimator. The standard error of \({\overline{x}}\) is the standard deviation of the sampling distribution of \({\overline{x}}\). It is the indicator of ‘Sampling Fluctuation.’

5.3 Approaches

Population Size = 100, \({\alpha} = 0.05\)

Hypothesis: \(\text{\{Right Tail or Upper Tail\} } {H_0} : {\mu} \leq 22 \iff {H_a}: {\mu} > 22\)

Sample: n=4, dof = 3, \({\overline{x}} = 23\)

Sample: n=10, dof = 9, \({\overline{x}} = 23\)

We know if we take another sample, we will have a different sample mean. So, we need to confirm whether the above calculated sample mean \({\overline{x}} = 23\) represent the population mean \({\mu}\) i.e. Can we reject or fail to reject \({H_0}\) based on this sample!

3 Approaches for Hypothesis Testing -

  1. Test Statistic Approach
    • Fictitious values: Standard Error (SE) = 0.22, so \(t = \frac{23 - 22}{0.22} = 4.545\)
    • For (DOF = 3): \(P_{(t)} = {\alpha} = 0.05\), at \({}^{3}t_{\alpha} = 2.353\)
    • For (DOF = 9): \(P_{(t)} = {\alpha} = 0.05\), at \({}^{9}t_{\alpha} = 1.833\)
    • For both the cases, \({t}\) is greater than \({}^{dof}t_{\alpha}\)
    • Hence null is rejected, the ‘test is statistically significant’
  2. p-value approach
    • Fictitious values: Standard Error (SE) = 0.22, so \(t = \frac{23 - 22}{0.22} = 4.545\)
    • Get \({}^3\!P_{(t = 4.545)} = 0.00997\)
    • Get \({}^9\!P_{(t = 4.545)} = 0.000697\)
    • For both the cases, \(P_{(t)}\) is lower than \({\alpha}\)
    • Hence null is rejected, the ‘test is statistically significant’
  3. Confidence Interval Approach

If the population standard error (SE) is known, apply z-test. If it is unknown, apply t-test. t-test will converge to z-test with increasing sample size.

2-T Rule of Thumb - Skipped “09:55”

# #Get P(z)
z01 <- round(pnorm(3.44), digits = 6)
z02 <- 1 - round(pnorm(3.44), digits = 6)
z03 <- round(pnorm(3.44, lower.tail = FALSE), digits = 6)
z04 <- format(pnorm(4.55, lower.tail = FALSE), digits = 3, scientific = FALSE)
z05 <- format(pnorm(1.22, lower.tail = FALSE), digits = 5)
z06 <- format(pnorm(1.99, lower.tail = FALSE), digits = 5)
z07 <- format(pnorm(1.99, lower.tail = TRUE), digits = 5)

Examples

Example:

  1. Question: If we get a z-value of 3.44 (Right Tail), What is the Probability \(P_{(z)}\)
    • For z = 3.44 & Left Tail, p-value = 0.999709 (by ‘pnorm(z)‘)
    • For z = 3.44 & Right Tail, p-value = 2.91^{-4} (by ‘1 - pnorm(z)‘)
    • For z = 3.44 & Right Tail, p-value = 2.91^{-4} (by ‘pnorm(z, lower.tail = FALSE)‘)
  2. Question: If we get a z-value of 4.55 (Right Tail), What is the Probability \(P_{(z)}\)
    • For z = 4.55 & Right Tail, p-value = 0.00000268
  3. Question: If we get a z-value of 1.22 (Right Tail), would we reject the null at \({\alpha} = 0.05\)
    • For z = 1.22 & Right Tail, p-value = 0.11123
    • Because \(P_{(z)}\) is greater than the \({\alpha}\), we fail to reject the null, the ‘test is statistically NOT significant’
  4. Question: If we get a z-value of 1.99 (Right Tail), would we reject the null at \({\alpha} = 0.05\)
    • For z = 1.99 & Right Tail, p-value = 0.023295 ( = 1 - 0.9767)
    • Because \(P_{(z)}\) is lower than the \({\alpha}\), null is rejected, the ‘test is statistically significant’

Code

# #Get P(z)
z01 <- round(pnorm(3.44), digits = 6)
z02 <- 1 - round(pnorm(3.44), digits = 6)
z03 <- round(pnorm(3.44, lower.tail = FALSE), digits = 6)
z04 <- format(pnorm(4.55, lower.tail = FALSE), digits = 3, scientific = FALSE)
z05 <- format(pnorm(1.22, lower.tail = FALSE), digits = 5)
z06 <- format(pnorm(1.99, lower.tail = FALSE), digits = 5)
z07 <- format(pnorm(1.99, lower.tail = TRUE), digits = 5)

5.4 Flowchart

  • Tests |
    • Test of Means |
      • One Sample |
        • z-test (Population Standard Deviation \({\sigma}\), is known)
        • t-test (Population Standard Deviation \({\sigma}\), is unknown)
      • Two Sample |
      • More than Two Samples

5.5 Two Sample t-Test

25.2 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu}_1 - {\mu}_2 \geq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 < {D_0}\)

25.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)

25.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)

Example:

25.6 Independent sample design: A simple random sample of workers is selected and each worker in the sample uses method 1. A second independent simple random sample of workers is selected and each worker in this sample uses method 2.

25.7 Matched sample design: One simple random sample of workers is selected. Each worker first uses one method and then uses the other method. The order of the two methods is assigned randomly to the workers, with some workers performing method 1 first and others performing method 2 first. Each worker provides a pair of data values, one value for method 1 and another value for method 2.

Test Statistic for Independent Sample t-Test Statistic is given by (25.9) as shown below

\[t = \frac{({\overline{x}}_1 - {\overline{x}}_2) - {D}_0}{{\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)}} = \frac{({\overline{x}}_1 - {\overline{x}}_2) - {D}_0}{\sqrt{\frac{{s}_1^2}{{n}_1} + \frac{{s}_2^2}{{n}_2}}}\]

The t-test is any statistical hypothesis test in which the test statistic follows a Student t-distribution under the null hypothesis.

A t-test is the most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known.

Example: If we want to evaluate effect of a Training Program.

We can take two samples of 50 people each. First Set “Untrained” would be from the set of people who did not receive training. Second Set “Trained” would be from the set of people who have undergone the training. Comparison of these two sample mean performances would be done by “independent sample” t-test.

Or

We can take a sample of 50 “Untrained” people. Get their mean performance. Provide the training of these 50 people. Then again get their mean performance. Now, we have “paired” samples of performances of same people. One set has their performance before the training and another is after the training. Comparison of these two sample mean performances would be done by “paired sample” t-test.

Paired samples t-tests typically consist of a sample of matched pairs of similar units, or one group of units that has been tested twice (a “repeated measures” t-test).

The matched sample design is generally preferred to the independent sample design because the matched-sample procedure often improves the precision of the estimate.

  • Question: How do you conclude that the training was effective. Or even that the training didnot make the situation worse somehow. Assume 25 people performed better and 25 people performed worse (somehow!)
    • (Prof) In this situation we cannot claim that the training is effective.
      • We start with the null hypothesis that the two sample means are same (Two Tail Test). If we are able to reject this. Then we can perform Upper or Lower Tail Test and again try to find a significant result.
    • If we cannot reject the null hypothesis, then we conclude that “we cannot claim that the training is effective.”
  • Example: If we want to comment on performance of new employees compared to ideal value of 95. Take a sample of 20 people and get their perfomance.
    • We would need to perform One Sample t-test
  • Example: If we want to comment on performance of Engineers and Non-engineers
    • Two Sample Independent t-test
  • Example: We want to check whether we have recruited more number of females compared to males.
    • Two Sample Proportion Test
  • Example: If we want to comment on their induction training program by conducting a test before and after the program
    • Two Sample Paired t-test

5.6 More than Two Samples

Assume there are 3 samples A, B, C. We can do \(C_2^3 = 3\) number of tests i.e. \(\{(A, B), (B, C), (C, A)\}\). However, assuming \({\alpha} = 0.05 \iff {\gamma} = 0.95\) for each test, the confidence for 3 consecutive tests become \({\gamma}^3 = 0.857 \iff {\alpha} = 0.143\), which is a very high and unacceptable value. To avoid this, we use ANOVA as a single test.

High value of F-test would indicate that the populations are different.

Validation


6 Statistics (B14, Oct-10)

6.1 Overview

Equality in Hypothesis

  • The equality part of the expression \(\{\mu \geq \mu_0 \, | \, \mu \leq \mu_0 \, | \, \mu = \mu_0\}\) always appears in the null hypothesis \({H_0}\).
    • We try to reject null, so that we can confidently accept the alternate. If the alternate is ambiguous e.g. “is greater than or equal to” then we will not be able to conclude with confidence.
  • Alternative hypothesis is often what the test is attempting to establish.
    • Hence, asking whether the user is looking for evidence to support \(\{\mu < \mu_0 \, | \, \mu > \mu_0 \, | \, \mu \neq \mu_0\}\) will help determine \({H_a}\)

24.14 If \({\sigma}\) is known, the standard normal random variable \({z}\) is used as test statistic to determine whether \({\overline{x}}\) deviates from the hypothesized value of \({\mu}\) enough to justify rejecting the null hypothesis. Refer equation (24.1) \(\to z = \frac{\overline{x} - {\mu}_0}{{\sigma}_{\overline{x}}} = \frac{\overline{x} - {\mu}_0}{{\sigma}/\sqrt{n}}\)

24.24 If \({\sigma}\) is unknown, the sampling distribution of the test statistic follows the t distribution with \((n - 1)\) degrees of freedom. Refer equation (24.3) \(\to t = \frac{{\overline{x}} - {\mu}_0}{{s}/\sqrt{n}}\)

6.2 Example: WSES: Preprocessing

Please import the WSES data in xxWSES object. Due to copyright, it has not been shared.

  • Assuming: Average Sales 8-million \(({\mu}_0 = 8)\), Standard Deviation 2-million \(({\sigma} = 2)\)
  • Hypothesis test to check whether the average sales value in the population is at least 8-million

24.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)

Caution: While importing data, for Mac users, probably it will be easier to import CSV file. However, I am not a Mac user so cannot comment on this.

Data

# #Import the Data and assign to a temporary variable for ease of use
xxWSES <- f_getRDS(xxWSES)
bb <- xxWSES
str(bb)
## spec_tbl_df [1,000 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Opportunity No.                 : num [1:1000] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Reporting Status                : chr [1:1000] "Lost" "Won" "Lost" "Won" ...
##  $ Sales Outcome                   : num [1:1000] 0 1 0 1 1 0 0 0 0 0 ...
##  $ Product                         : chr [1:1000] "LearnSys" "GTMSys" "GTMSys" "GTMSys" ...
##  $ Industry                        : chr [1:1000] "Banks" "Airline" "Capital Markets" "Insurance" ...
##  $ Region                          : chr [1:1000] "Africa" "UK" "UK" "UK" ...
##  $ Relative Strength in the segment: num [1:1000] 45 56 48 58 49 64 54 54 67 54 ...
##  $ Profit of Customer in Million   : num [1:1000] 2.11 0.79 1.62 0.09 1.46 0.94 1.32 1.09 1.3 1.32 ...
##  $ Sales Value in Million          : num [1:1000] 10.29 11.42 5.63 10.17 10.6 ...
##  $ Profit %                        : num [1:1000] 29 46 70 46 32 65 50 74 46 57 ...
##  $ WSES Proportion in Joint Bid    : num [1:1000] 66 50 50 56 54 52 68 48 56 63 ...
##  $ Leads Conversion Class          : chr [1:1000] "F" "E" "F" "E" ...

Rename Headers

# #List Column Headers
names(bb)
##  [1] "Opportunity No."                  "Reporting Status"                
##  [3] "Sales Outcome"                    "Product"                         
##  [5] "Industry"                         "Region"                          
##  [7] "Relative Strength in the segment" "Profit of Customer in Million"   
##  [9] "Sales Value in Million"           "Profit %"                        
## [11] "WSES Proportion in Joint Bid"     "Leads Conversion Class"
#
# #Rename Headers
bb_headers <- c("SN" , "RS" , "SO" , "PDT" , "INT" , "RG" , "RS1" , "PM" , "SVM" , "PP" , "JB" , "LCC")
names(bb) <- bb_headers
#
# #Verification
names(bb)
##  [1] "SN"  "RS"  "SO"  "PDT" "INT" "RG"  "RS1" "PM"  "SVM" "PP"  "JB"  "LCC"
#

Conversion to Factor

  • From the case study, it can be seen that multiple columns are categorical (factor) or ordinal (ordered factor)

  • Question: What is the importance of having this kind of order factor over simple factor

    • Here order is also important. Also in future, ordered factors will be needed for some analysis
    • (Aside) Refer Scales of Measurement
      • Simple factor (nominal) can provide only Mode whereas Ordered factor (ordinal) can provide Median also. Rank based statistical models can be applied on the ordinal data.
  • Question: If ‘RS’ is already integer 0 & 1, then why convert it to factor

    • While the data shows them as 0 & 1 but actually they are NOT integers
    • If we show male and female as 0 & 1, these are still categorical
  • Question: Why LCC is NOT ordinal (Aside)

    • Unknown “ForLater”

Data

# #"Reporting Status i.e. RS" Converting "character" to "factor" and Label them
bb$RS <- factor(bb$RS, levels = c("Lost", "Won"), labels = c("0", "1"))
#
# #"Sales Outcome i.e. SO" Converting "numeric" to "factor" 
bb$SO <- factor(bb$SO)
#
# #"Product Vertical i.e. PDT" Ordinal
# #What are the unique values in this column
unique(bb$PDT)
## [1] "LearnSys"   "GTMSys"     "Lifesys"    "Finsys"     "Procsys"    "Logissys"   "ContactSys"
#
# #Converting "character" to "Ordered factor"
# #Note: If level order is not provided, by default, alphabatical ordering will be assigned.
levels(factor(bb$PDT, ordered = TRUE))
## [1] "ContactSys" "Finsys"     "GTMSys"     "LearnSys"   "Lifesys"    "Logissys"   "Procsys"
#
# #Provide ordering of factor levels in Ascending Order.
bb$PDT <- factor(bb$PDT, ordered = TRUE, 
      levels = c("GTMSys", "Procsys", "LearnSys", "Finsys", "Lifesys", "Logissys", "ContactSys"))
#
# #"Industry i.e. INT" Ordinal
bb$INT <- factor(bb$INT, ordered = TRUE, 
      levels = c("Capital Markets", "Banks", "Defense", "Consumer goods", "Others", "Security", 
        "Energy", "Insurance", "Airline", "Finance", "Infrastructure", "Mobility", "Other Govt.", 
        "Govt.", "Telecom equipments", "Health", "Clinical research", "Agriculture"))
#
# #"Region i.e. RG" Ordinal
bb$RG <- factor(bb$RG, ordered = TRUE, levels = c("UK", "Other Europe", "Americas", "Africa",
                                                "India", "Japan", "Singapore", "Spain", "Canada"))
#
# #"Leads Conversion Class i.e. LCC" Ordinal, However we are going with Nominal here.
bb$LCC <- factor(bb$LCC, levels = c("E", "V", "F", "L"), labels = c(1, 2, 3, 4))

Structure after Conversion

str(bb)
## spec_tbl_df [1,000 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ SN : num [1:1000] 1 2 3 4 5 6 7 8 9 10 ...
##  $ RS : Factor w/ 2 levels "0","1": 1 2 1 2 2 1 1 1 1 1 ...
##  $ SO : Factor w/ 2 levels "0","1": 1 2 1 2 2 1 1 1 1 1 ...
##  $ PDT: Ord.factor w/ 7 levels "GTMSys"<"Procsys"<..: 3 1 1 1 1 5 4 4 1 4 ...
##  $ INT: Ord.factor w/ 18 levels "Capital Markets"<..: 2 9 1 8 5 5 4 4 16 4 ...
##  $ RG : Ord.factor w/ 9 levels "UK"<"Other Europe"<..: 4 1 1 1 1 2 1 1 1 1 ...
##  $ RS1: num [1:1000] 45 56 48 58 49 64 54 54 67 54 ...
##  $ PM : num [1:1000] 2.11 0.79 1.62 0.09 1.46 0.94 1.32 1.09 1.3 1.32 ...
##  $ SVM: num [1:1000] 10.29 11.42 5.63 10.17 10.6 ...
##  $ PP : num [1:1000] 29 46 70 46 32 65 50 74 46 57 ...
##  $ JB : num [1:1000] 66 50 50 56 54 52 68 48 56 63 ...
##  $ LCC: Factor w/ 4 levels "1","2","3","4": 3 1 3 1 2 3 2 2 4 4 ...

factor()

  • factor()
    • If level order is not provided, by default, alphabatical ordering will be assigned.
    • levels are the input, labels are the output in the factor() function.
    • A factor has only a level attribute, which is set by the labels argument in the factor() function.
    • different levels are coded as (“E,” “V,” “F,” “L”)
      • for which you want the levels to be labeled as c(1, 2, 3, 4).
      • The factor function will look for the values (“E,” “V,” “F,” “L”) convert them to numerical factor classes and add the label values to the ‘level’ attribute of the factor.
      • This attribute is used to convert the internal numerical values to the correct labels.
      • Hoever, there is no ‘label’ attribute.

6.2.1 Conversion to Numeric

# #If there are "character" columns which should be "numeric"
bb$RS1 <- as.numeric(bb$RS1)
bb$PP <- as.numeric(bb$PP)
bb$JB <- as.numeric(bb$JB)
# #Equivalent
bb <- bb %>% mutate(across(c(RS1, PP, JB), as.numeric))

6.3 WSES: Analysis

6.3.1 Q1

Assume average sales of 8-million dollars and population standard deviation to be 2-million dollars. Check whether the average sales value in the population is at least 8 million dollars.

24.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)

  • Mean of “Sales Value in Million i.e. SVM”
    • \({\overline{x}} = 8.0442\)
      • mean(bb$SVM) \(\#\mathcal{R}\)
  • \(n = 1000, {\mu} = 8, {\sigma} = 2\)
  • \(z = \frac{\overline{x} - {\mu}_0}{{\sigma}/\sqrt{n}} = 0.6988\)
    • {8.0442 - 8} / {2 / sqrt(1000)} \(\#\mathcal{R}\)
    • {mean(bb$SVM) - 8} / {2 / sqrt(1000)} \(\#\mathcal{R}\)
  • \({}^U\!P_{z = 0.6988} = 0.2423\)
    • pnorm(q = 0.6988, lower.tail = FALSE) \(\#\mathcal{R}\)
    • 1 - pnorm(q = 0.6988, lower.tail = TRUE) \(\#\mathcal{R}\)
    • 1 - pnorm(q = 0.6988) \(\#\mathcal{R}\)
    • pnorm(q = -0.6988) \(\#\mathcal{R}\)
    • Caution: By default, pnorm() provides the probability to the left of z-value
  • Compare with \({\alpha} = 0.05\)
    • \({}^U\!P_{(z)} > {\alpha} \to {H_0}\) cannot be rejected
    • The sample results do not provide sufficient evidence to conclude that the average sales is ‘at least’ 8-million

6.3.2 Q2

If the population standard deviation is unknown, evaluate the same hypothesis again.

  • \(t = \frac{\overline{x} - {\mu}_0}{{s}/\sqrt{n}} = 0.7049\)
    • {mean(bb$SVM) - 8} / {sd(bb$SVM) / sqrt(1000)} \(\#\mathcal{R}\)
  • \({}^U\!P_{t = 0.7049} = 0.2405\)
    • pt(q = 0.7049, df = nrow(bb) - 1, lower.tail = FALSE) \(\#\mathcal{R}\)
  • t.test()
    • t.test(bb$SVM, mu = 8, alternative = "greater") \(\#\mathcal{R}\)
  • Compare with \({\alpha} = 0.05\)
    • \({}^U\!P_{(t)} > {\alpha} \to {H_0}\) cannot be rejected
    • The sample results do not provide sufficient evidence to conclude that the average sales is ‘at least’ 8-million (same as earlier)
  • Question: “alternative hypothesis: true mean is greater than 8” What is the meaning of this line in the output of t.test
    • We provided this information.
    • (Aside) This line is providing the Alternate Hypothesis for reference. \({\mu}_0 = 8\) and we are performing an upper tail test thus the Alternate Hypothesis is “true mean is greater than 8”

6.3.3 Q3

Check whether the proportion of leads won by WSES is more than 50%.

24.26 \(\text{\{Right or Upper\} } {H_0} : {p} \leq {p}_0 \iff {H_a}: {p} > {p}_0\)

  • Count of Success \(({x})\) is Winning leads in the “Sales Outcome i.e. SO”
  • \({p}_0 = 0.50\)
    • (24.4) \({\sigma}_{\overline{p}} = \sqrt{\frac{{p}_0 (1 - {p}_0)}{n}} = 0.0158\)
      • sqrt(0.50 * {1 - 0.50} / 1000) \(\#\mathcal{R}\)
  • \(\{n = 1000, x = 481\} \to {\overline{p}} = {n}/{x} = 0.481\)
  • (24.5) \(z = \frac{{\overline{p}} - {p}_0}{{\sigma}_{\overline{p}}} = -1.2016\)
    • {0.481 - 0.50}/{sqrt(0.50 * {1 - 0.50} / 1000)} \(\#\mathcal{R}\)
  • \({}^U\!P_{z = -1.2016} = 0.8852\)
    • pnorm(q = -1.2016, lower.tail = FALSE) \(\#\mathcal{R}\)
    • 1 - pnorm(q = -1.2016) \(\#\mathcal{R}\)
  • Compare with \({\alpha} = 0.05\)
    • \({}^U\!P_{(z)} > {\alpha} \to {H_0}\) cannot be rejected
    • The sample results do not provide sufficient evidence to conclude that the company wins ‘at least’ 50% leads

Code

# #Proportions
bb %>% group_by(SO) %>% summarise(PCT = n() / nrow(.))
## # A tibble: 2 x 2
##   SO      PCT
##   <fct> <dbl>
## 1 0     0.519
## 2 1     0.481
#
pnorm(q = -1.2016, lower.tail = FALSE)
## [1] 0.8852407

Percentage

# #Grouped Percentages
# #table() gives a count
table(bb$SO)
## 
##   0   1 
## 519 481
#
# #prop.table() can work only with numbers so it needs table()
prop.table(table(bb$SO))
## 
##     0     1 
## 0.519 0.481
#
# #Similar
bb %>% group_by(SO) %>% summarise(N = n()) %>% mutate(PCT = N / sum(N))
## # A tibble: 2 x 3
##   SO        N   PCT
##   <fct> <int> <dbl>
## 1 0       519 0.519
## 2 1       481 0.481
bb %>% group_by(SO) %>% summarise(PCT = n() / nrow(.))
## # A tibble: 2 x 2
##   SO      PCT
##   <fct> <dbl>
## 1 0     0.519
## 2 1     0.481

6.3.4 Q4

Check whether the probability of winning a sales lead for the product “learnsys” is more than that of “Finsys.”

25.10 \(\text{\{Right or Upper\} } {H_0} : {p}_1 - {p}_2 \leq 0 \iff {H_a}: {p}_1 - {p}_2 > 0\)

  • Count of Success \(({x})\) is Winning leads in the “Sales Outcome i.e. SO”
  • (1: learnsys) \(\{{n}_1 = 55 + 71 = 126, {x}_1 = 71\} \to {\overline{p}}_1 = {n}_1/{x}_1 = 0.563\)
  • (2: Finsys) \(\{{n}_2 = 83 + 34 = 117, {x}_2 = 34\} \to {\overline{p}}_2 = {n}_2/{x}_2 = 0.29\)
  • \({}^U\!P_{z} < {\alpha} \to {H_0}\) is rejected i.e. the proportions are different
    • We can conclude that the “learnsys” performs better than “Finsys”
  • Question: When we have filtered out the data, why the table shows 0 values
    • We can filter them out, however these are in memory
    • (Aside) Factor levels are not dropped, by default, when you filter them. Use factor() again to drop the unused levels.

Test

# #Data | Subset | Filter | Update Factor levels
ii <- bb %>% select(PDT, SO, SVM) %>% 
  filter(PDT %in% c("LearnSys", "Finsys")) %>% mutate(across(PDT, factor))
str(ii)
## tibble [243 x 3] (S3: tbl_df/tbl/data.frame)
##  $ PDT: Ord.factor w/ 2 levels "LearnSys"<"Finsys": 1 2 2 2 1 2 2 1 2 2 ...
##  $ SO : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 2 1 ...
##  $ SVM: num [1:243] 10.29 5.25 7.3 9.69 6.82 ...
#
# #Count
table(ii$PDT, ii$SO)
##           
##             0  1
##   LearnSys 55 71
##   Finsys   83 34
#
# #Proportion Table
round(prop.table(table(ii$PDT, ii$SO), margin = 1), 3)
##           
##                0     1
##   LearnSys 0.437 0.563
##   Finsys   0.709 0.291
#
prop.test(x = c(71, 34), n = c(126, 117), alternative = "greater")
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(71, 34) out of c(126, 117)
## X-squared = 17.316, df = 1, p-value = 1.583e-05
## alternative hypothesis: greater
## 95 percent confidence interval:
##  0.1644089 1.0000000
## sample estimates:
##    prop 1    prop 2 
## 0.5634921 0.2905983

prop.table()

ii <- bb %>% select(PDT, SO, SVM) %>% 
  filter(PDT %in% c("LearnSys", "Finsys")) %>% mutate(across(PDT, factor))
str(ii)
## tibble [243 x 3] (S3: tbl_df/tbl/data.frame)
##  $ PDT: Ord.factor w/ 2 levels "LearnSys"<"Finsys": 1 2 2 2 1 2 2 1 2 2 ...
##  $ SO : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 2 1 ...
##  $ SVM: num [1:243] 10.29 5.25 7.3 9.69 6.82 ...
#
# #Proportion Table: margin gives the margin to split by i.e. 
# #1 means rowwise sum, 2 means columnwise
round(prop.table(table(ii$PDT, ii$SO), margin = 1), 3)
##           
##                0     1
##   LearnSys 0.437 0.563
##   Finsys   0.709 0.291
round(prop.table(table(ii$PDT, ii$SO), margin = 2), 3)
##           
##                0     1
##   LearnSys 0.399 0.676
##   Finsys   0.601 0.324
#
# #Similar
ii %>% select(PDT, SO) %>% 
  count(PDT, SO) %>% pivot_wider(names_from = SO, values_from = n) %>% 
  mutate(SUM = rowSums(across(where(is.numeric)))) %>% 
  mutate(across(where(is.numeric), ~ round(. * 100 /SUM, 1))) 
## # A tibble: 2 x 4
##   PDT        `0`   `1`   SUM
##   <ord>    <dbl> <dbl> <dbl>
## 1 LearnSys  43.7  56.3   100
## 2 Finsys    70.9  29.1   100

6.3.5 Q5

Check whether the average sales value of “learnsys” projects is higher than that of “Finsys” projects. (\({\alpha} = 0.05\))

25.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)

  • (Pooled) t-test for difference of means can be applied only if the variances are same.
    • Location of two distributions can be compared only when their spread is similar.
    • (Aside) if the variances are not same, then Welch Test is applied in place of Pooled Test.
  • \({}^U\!P_{t} > {\alpha} \to {H_0}\) cannot be rejected
    • The sample results do not provide sufficient evidence to conclude that the “learnsys” has higher average sales than “Finsys”
  • Question: What is ‘SVM ~ PDT’
    • It is the formula notation for t-test (in R) in the form of ‘Continuous ~ Categorical’

“ForLater”

  • Question: How to decide whether to keep the Lost bids or exclude them
    • Including lost bids: \(\text{(DOF)} = 241, t = 0.93503, P_{(t)} = 0.1754 \to {H_0}\) cannot be rejected
    • Excluding lost bids: \(\text{(DOF)} = 103, t = 0.62152, P_{(t)} = 0.2678 \to {H_0}\) cannot be rejected
    • Thus, no change in Hypothesis results but which which process should be utilised
    • Similar, question can be raised for Q6 i.e. should be consider profit of those bids which we lost
str(ii)
## tibble [243 x 3] (S3: tbl_df/tbl/data.frame)
##  $ PDT: Ord.factor w/ 2 levels "LearnSys"<"Finsys": 1 2 2 2 1 2 2 1 2 2 ...
##  $ SO : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 2 1 ...
##  $ SVM: num [1:243] 10.29 5.25 7.3 9.69 6.82 ...
ii_var <- var.test(SVM ~ PDT, data = ii)
if(ii_var$p.value > 0.05) {
  cat(paste0("Variances are same. Pooled Test can be applied. \n"))
  t.test(formula = SVM ~ PDT, data = ii, alternative = "greater", var.equal = TRUE)
} else {
  cat(paste0("Variances are NOT same. Welch Test should be applied. \n"))
  t.test(formula = SVM ~ PDT, data = ii, alternative = "greater", var.equal = FALSE)
}
## Variances are same. Pooled Test can be applied.
## 
##  Two Sample t-test
## 
## data:  SVM by PDT
## t = 0.93503, df = 241, p-value = 0.1754
## alternative hypothesis: true difference in means between group LearnSys and group Finsys is greater than 0
## 95 percent confidence interval:
##  -0.1728648        Inf
## sample estimates:
## mean in group LearnSys   mean in group Finsys 
##               8.030476               7.804786
jj <- ii %>% filter(SO == "1")
jj_var <- var.test(SVM ~ PDT, data = jj)
if(jj_var$p.value > 0.05) {
  t.test(formula = SVM ~ PDT, data = jj, alternative = "greater", var.equal = TRUE)
} else {
  cat(paste0("Problem: Difference of means can be tested only if the variances are same.\n"))
}
## 
##  Two Sample t-test
## 
## data:  SVM by PDT
## t = 0.62152, df = 103, p-value = 0.2678
## alternative hypothesis: true difference in means between group LearnSys and group Finsys is greater than 0
## 95 percent confidence interval:
##  -0.3949429        Inf
## sample estimates:
## mean in group LearnSys   mean in group Finsys 
##               7.947887               7.711471

6.3.6 Q6

Check whether there is a difference in the average profit across the geographical locations: United Kingdom, India and the Americas.

28.2 \(\text{\{ANOVA\}} {H_0} : {\mu}_1 = {\mu}_2 = \dots = {\mu}_k \iff {H_a}: \text{Not all population means are equal}\)

  • We need to conduct ANOVA because there are more than 2 levels we are checking
  • aov()
    • Total Variance = Between or MSTR + Within or MSE
      • Sum Sq provides SSTR & SSE
      • Mean Sq provides MSTR (Between) & MSE (Within)
    • First Line (Column): \(\text{DOF}_{(k-1)} = 2, \text{SSTR} = 297, \text{MSTR} = 148.7\)
    • Residuals (Within) : \(\text{DOF}_{(n-k)} = 689, \text{SSE} = 68075, \text{MSE} = 98.8\)
    • (28.10) \(F = \frac{\text{MSTR}}{\text{MSE}} = 1.5\)
    • \({}^U\!P_{F = 1.5} = 0.2238\)
      • pf(q = 1.5, df1 = 2, df2 = 689, lower.tail = FALSE) \(\#\mathcal{R}\)
    • Compare with \({\alpha} = 0.05\)
      • \({}^U\!P_{(F)} > {\alpha} \to {H_0}\) cannot be rejected
      • The sample results do not provide sufficient evidence to conclude that the average profit differs across the 3 geographical locations.
  • Question: Sample size of these 3 groups are different (UK 553, Americas 104, India 35). Would this impact our analysis
    • If the sample size are similar, then the power will be more but otherwise ANOVA is not very sensitive to unequal sample sizes. We can safely use it.
  • Warning:
    • “Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, …)”
    • remove ‘type’ option from aov()
ii <- bb %>% filter(RG %in% c("UK", "India", "Americas")) %>% select(RG, PP, SO)
str(ii)
## tibble [692 x 3] (S3: tbl_df/tbl/data.frame)
##  $ RG: Ord.factor w/ 9 levels "UK"<"Other Europe"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ PP: num [1:692] 46 70 46 32 50 74 46 57 60 58 ...
##  $ SO: Factor w/ 2 levels "0","1": 2 1 2 2 1 1 1 1 1 1 ...
#
# #ANOVA
ii_aov <- aov(formula = PP ~ RG, data = ii)
#
# #
model.tables(ii_aov, type = "means")
## Tables of means
## Grand mean
##         
## 50.7052 
## 
##  RG 
##        UK Americas India
##      50.6    50.33 53.51
## rep 553.0   104.00 35.00
#
# #Summary
summary(ii_aov)
##              Df Sum Sq Mean Sq F value Pr(>F)
## RG            2    297   148.7   1.505  0.223
## Residuals   689  68075    98.8
jj <- ii %>% filter(SO == "1")
str(jj)
## tibble [339 x 3] (S3: tbl_df/tbl/data.frame)
##  $ RG: Ord.factor w/ 9 levels "UK"<"Other Europe"<..: 1 1 1 3 1 1 1 1 1 1 ...
##  $ PP: num [1:339] 46 46 32 42 40 49 42 50 38 56 ...
##  $ SO: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
#
# #ANOVA
jj_aov <- aov(formula = PP ~ RG, data = jj)
#
# #
model.tables(jj_aov, type = "means")
## Tables of means
## Grand mean
##          
## 46.37168 
## 
##  RG 
##         UK Americas India
##      46.23    46.36 48.65
## rep 267.00    55.00 17.00
#
# #Summary
summary(jj_aov)
##              Df Sum Sq Mean Sq F value Pr(>F)
## RG            2     93   46.75   0.532  0.588
## Residuals   336  29508   87.82

6.3.7 Q7

Check whether the sales conversions are different for different geographical locations.

  • Both (Location and Sales Conversions) are categorical variables, So ChiSq Test is required
  • \(P_{\chi^2} < {\alpha} \to {H_0}\) is rejected i.e. proportions are different
    • It can be claimed that there is an association between these variables.
    • We can conclude that the sales conversions are different for different geographical locations
  • Warning:
    • “Warning in chisq.test() Chi-squared approximation may be incorrect:
    • Some of observed frequencies in the Table are too few e.g. Canada
ii <- bb %>% select(RG, SO)
str(ii)
## tibble [1,000 x 2] (S3: tbl_df/tbl/data.frame)
##  $ RG: Ord.factor w/ 9 levels "UK"<"Other Europe"<..: 4 1 1 1 1 2 1 1 1 1 ...
##  $ SO: Factor w/ 2 levels "0","1": 1 2 1 2 2 1 1 1 1 1 ...
#
round(prop.table(table(ii$RG, ii$SO), margin = 1), 3)
##               
##                    0     1
##   UK           0.517 0.483
##   Other Europe 0.582 0.418
##   Americas     0.471 0.529
##   Africa       0.409 0.591
##   India        0.514 0.486
##   Japan        0.375 0.625
##   Singapore    0.739 0.261
##   Spain        0.917 0.083
##   Canada       0.333 0.667
#
# #Chi-Sq Test
tryCatch(chisq.test(table(ii$RG, ii$SO)), 
         warning = function(w) {
           print(paste0(w))
           suppressWarnings(chisq.test(table(ii$RG, ii$SO)))
           })
## [1] "simpleWarning in chisq.test(table(ii$RG, ii$SO)): Chi-squared approximation may be incorrect\n"
## 
##  Pearson's Chi-squared test
## 
## data:  table(ii$RG, ii$SO)
## X-squared = 22.263, df = 8, p-value = 0.004452

6.3.8 Q8

  • Check whether the sales conversions depend on the sales value. Check this claim by making 3 groups of Sales Value: <6-million, [6-8], >8-million dollar

  • Both are categorical variables, So ChiSq Test is required

  • \(P_{\chi^2} > {\alpha} \to {H_0}\) cannot be rejected.

    • It cannot be claimed that there is an association between these variables.
    • The sample results do not provide sufficient evidence to conclude that the sales conversions depend on the sales value.
ii <- bb %>% select(SO, SVM)
#
# #Create 3 Groups with middle group inclusive of both 6 & 8 
ii$RSVM <- cut(ii$SVM, breaks = c(0, 5.9999, 8, 15), labels = 1:3)
#
summary(ii)
##  SO           SVM         RSVM   
##  0:519   Min.   : 1.640   1:155  
##  1:481   1st Qu.: 6.690   2:346  
##          Median : 8.000   3:499  
##          Mean   : 8.044          
##          3rd Qu.: 9.440          
##          Max.   :14.230
#
table(ii$RSVM, ii$SO)
##    
##       0   1
##   1  86  69
##   2 182 164
##   3 251 248
#
# #Chi-Sq Test
chisq.test(table(ii$RSVM, ii$SO))
## 
##  Pearson's Chi-squared test
## 
## data:  table(ii$RSVM, ii$SO)
## X-squared = 1.377, df = 2, p-value = 0.5023

Validation


7 Quiz (B15, Oct-17)

7.1 Overview

  • This covers a short quiz and a case study Case Study: JAT
    • Case analysis done on Nov-07 has been merged here.

7.2 Short Quiz

  1. In hypothesis testing,
    1. the smaller the Type I error, the smaller the Type II error will be
    2. the smaller the Type I error, the larger the Type II error will be
    3. Type II error will not be affected by Type I error
    4. the sum of Type I and Type II errors must equal to 1
    • Answer: b
  2. What type of error occurs if you accept \({H_0}\) when, in fact, it is not true
    1. Type II
    2. Type I
    3. either Type I or Type II, depending on the level of significance
    4. either Type I or Type II, depending on whether the test is one tail or two tail
    • Answer: a
  3. If the level of significance of a hypothesis test is raised from .01 to .05, the probability of a Type II error
    1. will also increase from .01 to .05
    2. will not change
    3. will decrease
    4. will increase
    • Answer: c
  4. The sum of the values of \({\alpha}\) and \({\beta}\)
    1. always add up to 1.0
    2. always add up to 0.5
    3. is the probability of Type II error
    4. None of these alternatives is correct
    • Answer: d
  5. Following the p-value approach, the null hypothesis is rejected if
    1. p-value less than or equal to \({\alpha}\)
    2. \({\alpha}\) < p-value
    3. p-value > \({\alpha}\)
    4. p-value = 1 - \({\alpha}\)
    • Answer: a
  6. The average manufacturing work week in metropolitan Chattanooga was 40.1 hours last year. It is believed that the recession has led to a reduction in the average work week. To test the validity of this belief, which of the following stated hypothesis formulation is true.
    1. \({H_0} : {\mu} < 40.1 \quad {H_a} : {\mu} \geq 40.1\)
    2. \({H_0} : {\mu} \geq 40.1 \quad {H_a} : {\mu} < 40.1\)
    3. \({H_0} : {\mu} > 40.1 \quad {H_a} : {\mu} \leq 40.1\)
    4. \({H_0} : {\mu} = 40.1 \quad {H_a} : {\mu} \neq 40.1\)
    • Answer: b
  7. Statement about a population developed for the purpose of testing is called:
    1. Hypothesis
    2. Hypothesis testing
    3. Level of significance
    4. Test-statistic
    • Answer: a
  8. The probability of rejecting the null hypothesis when it is true is called:
    1. Level of confidence
    2. Level of significance
    3. Power of the test
    4. Difficult to tell
    • Answer: b
  9. The dividing point between the region where the null hypothesis is rejected and the region where it is not rejected is said to be:
    1. Critical region
    2. Critical value
    3. Acceptance region
    4. Significant region
    • Answer: b
  10. If the critical region is located equally in both sides of the sampling distribution of test-statistic, the test is called:
    1. One tailed
    2. Two tailed
    3. Right tailed
    4. Left tailed
    • Answer: b
  11. Test of hypothesis \({H_0}\): \({\mu} \leq 50\) against \({H_a}\): \({\mu} > 50\) leads to:
    1. Left-tailed test
    2. Right-tailed test
    3. Two-tailed test
    4. Difficult to tell
    • Answer: b
  12. Test of hypothesis \({H_0}\): \({\mu} = 20\) against \({H_1}\): \({\mu} < 20\) leads to:
    1. Right one-sided test
    2. Left one-sided test
    3. Two-sided test
    4. All of the above
    • Answer: b
  13. A failed student has been promoted by an examiner; it is an example of:
    1. Type-I error
    2. Type-II error
    3. Unbiased decision
    4. Difficult to tell
    • Answer: b
  14. The probability of accepting \({H_0}\) when it is True is called:
    1. Power of the test
    2. Size of the test
    3. Level of confidence
    4. Confidence coefficient
    • Answer: d
  15. Power of a test is directly related to:
    1. Type-I error
    2. Type-II error
    3. Both (a) and (b)
    4. Neither (a) nor (b)
    • Answer: b

7.3 Case: JAT

Please import the Jayalaxmi data.

# #Object Names for each sheet
namesJ <- c("xxJdata", "xxJbela", "xxJdhar", "xxJdiseases")
# #Dimensions of these datasets
str(lapply(namesJ, function(x) {dim(eval(parse(text = x)))}))
## List of 4
##  $ : int [1:2] 123 26
##  $ : int [1:2] 24 14
##  $ : int [1:2] 22 14
##  $ : int [1:2] 6 4

7.3.2 Q1

Test the claim that disease 6 (leaf curl) information was accessed at least 60 times every month on average since October 2017 due to this disease outbreak. \(({\alpha} = 0.05)\)

NOTE: Actually the claim is “at least 60 times every week.” Month is a printing error.

24.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)

  • \({}^U\!P_{t = 2.341} = 0.01329 \to {}^U\!P_{(z)} < {\alpha} \to {H_0}\) is rejected
    • We can conclude that the information on disease 6 was accessed at least 60 times every Week on average since October 2017. The claim is correct.
    • Originally tested for Monthly Data, which will obviously be True if Weekly rates are at least 60 times
      • \({}^U\!P_{t = 5.3771} = 0.0005 \to {}^U\!P_{(z)} < {\alpha} \to {H_0}\) is rejected
  • R Notes: For ease during subsetting, convert the Date Column to Date Type
    • as.Date(), as_date()

Code (per Week Claim)

# #Data | Rename | Change from Date Time to Date
bb <- xxJdata %>% 
  rename(Dates = "Month-Year") %>% 
  #group_by(Dates) %>% 
  #summarise(D6 = sum(D6)) %>% 
  mutate(across(Dates, as_date)) 
# 
# #Get relevant rows using filter() #xxJdata[95:123, ]
ii <- bb %>% filter(Dates >= "2017-10-01")
#
t_ii <- {mean(ii$D6) - 60} / {sd(ii$D6) / sqrt(nrow(ii))}
print(t_ii)
## [1] 2.341004
pt(t_ii, df = nrow(ii) - 1, lower.tail = FALSE)
## [1] 0.01329037
#
# #One Sample t-Test
t.test(ii$D6, mu = 60, alternative = "greater", conf.level = 0.05)
## 
##  One Sample t-test
## 
## data:  ii$D6
## t = 2.341, df = 28, p-value = 0.01329
## alternative hypothesis: true mean is greater than 60
## 5 percent confidence interval:
##  74.52782      Inf
## sample estimates:
## mean of x 
##  68.41379

Code (per Month Claim)

# #Data | Rename | Sum Months D6 | Change from Date Time to Date
bb <- xxJdata %>% 
  rename(Dates = "Month-Year") %>% 
  group_by(Dates) %>% 
  summarise(D6 = sum(D6)) %>% 
  mutate(across(Dates, as_date)) 
# 
# #There are missing months, but those months are not applicable in this question
# #Get relevant rows using filter()
ii <- bb %>% filter(Dates >= "2017-10-01")
#
t_ii <- {mean(ii$D6) - 60} / {sd(ii$D6) / sqrt(nrow(ii))}
print(t_ii)
## [1] 5.377075
pt(t_ii, df = nrow(ii) - 1, lower.tail = FALSE)
## [1] 0.000516808
#
# #One Sample t-Test
t.test(ii$D6, mu = 60, alternative = "greater", conf.level = 0.05)
## 
##  One Sample t-test
## 
## data:  ii$D6
## t = 5.3771, df = 7, p-value = 0.0005168
## alternative hypothesis: true mean is greater than 60
## 5 percent confidence interval:
##  314.2406      Inf
## sample estimates:
## mean of x 
##       248

Missing Months

str(bb)
## tibble [33 x 2] (S3: tbl_df/tbl/data.frame)
##  $ Dates: Date[1:33], format: "2015-06-01" "2015-07-01" "2015-08-01" ...
##  $ D6   : num [1:33] 0 6 31 41 74 104 91 88 74 88 ...
summary(bb)
##      Dates                  D6       
##  Min.   :2015-06-01   Min.   :  0.0  
##  1st Qu.:2016-02-01   1st Qu.: 71.0  
##  Median :2016-10-01   Median : 91.0  
##  Mean   :2016-10-25   Mean   :130.2  
##  3rd Qu.:2017-09-01   3rd Qu.:198.0  
##  Max.   :2018-05-01   Max.   :365.0
#
# #Assuming each row is one month with no duplicates
stopifnot(identical(anyDuplicated(bb$Dates), 0L))
#
# #Create Sequence of Months
#ii <- seq(ymd("2015-6-1"), ymd("2018-5-1"), by = "months")
ii <- tibble(Dates = seq(min(bb$Dates), max(bb$Dates), by = "months"))
#
diff_len <- nrow(ii) - nrow(bb)
#
if(!identical(diff_len, 0L)) {
  cat(paste0("Number of missing months = ", diff_len,"\n"))
  #
  # #Find Values that should be in Complete Sequence but are missing in the data
  as_date(setdiff(ii$Dates, bb$Dates))
  # #OR
  ii %>% anti_join(bb)
  #
  # #This does not need a separate Vector of all Months
  # #Get Months Difference using Integer Division and 
  # #Filter Rows which are not consecutive and rows above them
  bb %>% 
    mutate(diff_months = (interval(lag(Dates), Dates)) %/% months(1)) %>% 
    filter( (diff_months != 1) | lead(diff_months != 1) )
}
## Number of missing months = 3
## # A tibble: 2 x 3
##   Dates         D6 diff_months
##   <date>     <dbl>       <dbl>
## 1 2017-05-01    51           1
## 2 2017-09-01   111           4
# #Fill Missing Months 
jj <- as_tibble(merge(bb, ii, by = "Dates", all = TRUE)) 
kk <- right_join(bb, ii, by = "Dates") %>% arrange(Dates)
stopifnot(identical(jj, kk))
# #Replace NA 
ll <- kk %>% mutate(across(D6, coalesce, 0)) 

7.3.3 Q2

Test the claim that Among the app users for disease information, at least 15% of them access disease information related to disease 6. \(({\alpha} = 0.05)\)

24.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)

  • One Tail t-Test Caution: Proportion Test is needed.
    • Wrong Analysis:
      • \({}^U\!P_{t} = 0.094 \to {}^U\!P_{(t)} > {\alpha} \to {H_0}\) cannot be rejected.
      • We cannot conclude that at least 15% of users access D6 related information.
    • (Aside) Why using t-test is wrong here
      • 15% is NOT the \({\mu}\), it is proportion. Using that as Mean was wrong.
      • We are NOT looking for percentage of D6 lookup in each row or observation. It does not matter if one day farmers only looked for other diseases and another day farmers searched only D6. We are interested in overall share of D6 searches out of searches for Diseases
  • Proportion Test

Determine whether the proportion of farmers searching for D6 is more than \(p_0 = 0.15\)

24.26 \(\text{\{Right or Upper\} } {H_0} : {p} \leq {p}_0 \iff {H_a}: {p} > {p}_0\)

  • Count of Success \(({x})\) is Searches of D6
  • \(\{n = 26830, x = 4295\} \to {\overline{p}} = {n}/{x} = 0.160082\)
  • (24.5)
    • \({}^U\!P_{({\chi}^2)} < {\alpha} \to {H_0}\) is rejected i.e. the proportions are different
    • We can conclude that the proportion of farmers searching for D6 is more than \(p_0 = 0.15\) for all searches for Diseases.

Proportion Test

# #Data | Sum Disease, Variety, Micronutrients 
aa <- xxJdata %>% 
  mutate(sumD = rowSums(across(starts_with("D"))), 
         sumV = rowSums(across(starts_with("V")))) %>% 
  rename(Dates = "Month-Year", Users = "No of users", Micro = "Micronutrient") %>% 
  mutate(SUM = rowSums(across(c(sumD, sumV, Micro))),
         DIFF = Usage - SUM) %>% 
  select(Dates, Users, Usage, SUM, DIFF, sumD, sumV, Micro, D6) %>% 
  mutate(across(Dates, as_date)) %>% 
  mutate(Fraction = D6/sumD)
#
# #Confirmed that Usage is Sum Total of Disease, Variety, Micronutrients
unique(aa$DIFF)
## [1] 0
#
# #Working Set | Exclude 1 NA i.e. where sumD is zero
bb <- aa %>% drop_na(Fraction) %>% select(Usage, sumD, D6, Fraction)
#
# #Check n (Sample Count) and x (Count of Success)
bb %>% summarise(across(c(sumD, D6), sum))
## # A tibble: 1 x 2
##    sumD    D6
##   <dbl> <dbl>
## 1 26830  4295
#
# #One Sample Proportion Test with continuity correction
bb_prop <- prop.test(x = sum(bb$D6), n = sum(bb$sumD), p = 0.15, 
                     alternative = "greater", conf.level = 0.95)
bb_prop
## 
##  1-sample proportions test with continuity correction
## 
## data:  sum(bb$D6) out of sum(bb$sumD), null probability 0.15
## X-squared = 21.311, df = 1, p-value = 1.953e-06
## alternative hypothesis: true p is greater than 0.15
## 95 percent confidence interval:
##  0.1564156 1.0000000
## sample estimates:
##        p 
## 0.160082

Proportion Test (Usage)

# #Impact if we try to evaluate propotion of D6 searches out of ALL Usage (Disease, Variety, Micro)
# #Check n (Sample Count) and x (Count of Success)
bb %>% summarise(across(c(Usage, D6), sum))
## # A tibble: 1 x 2
##   Usage    D6
##   <dbl> <dbl>
## 1 71646  4295
#
# #One Sample Proportion Test with continuity correction
# #With p-value = 1, we cannot claim that 15% searches are for D6 only out of ALL Usage
prop.test(x = sum(bb$D6), n = sum(bb$Usage), p = 0.15, 
                     alternative = "greater", conf.level = 0.95)
## 
##  1-sample proportions test with continuity correction
## 
## data:  sum(bb$D6) out of sum(bb$Usage), null probability 0.15
## X-squared = 4556.2, df = 1, p-value = 1
## alternative hypothesis: true p is greater than 0.15
## 95 percent confidence interval:
##  0.05849838 1.00000000
## sample estimates:
##          p 
## 0.05994752

One Sample t-Test

# #One Sample t-Test (Wrong)
if(FALSE) {
  t.test(bb$Fraction, mu = 0.15, alternative = "greater", conf.level = 0.05)
}

7.3.4 Q3

Test the claim that the average number of users in year 2017-2018 is more than average number of users in year 2015-2016. \(({\alpha} = 0.05)\)

25.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)

  • If 1 is 2017-2018, 2 is 2015-2016 : Upper Tail Test is required
  • If 1 is 2015-2016, 2 is 2017-2018 : Lower Tail Test is required
  • Because the Variances are NOT same, Welch t-test is required, we cannot use Pooled Test
  • \(({}^L\!P_{t = -7.255} = {}^U\!P_{t = 7.255}) < {\alpha} \to {H_0}\) is rejected
    • The sample results provide sufficient evidence to conclude that the average number of users has increased.
  • Also, Test the claim that app usage picked up after January 2016. (Moved from Q4)
    • \({}^U\!P_{t} < {\alpha} \to {H_0}\) is rejected
    • The sample results provide sufficient evidence to conclude that the app usage has increased after January 2016.
  • NOTE: Ideally, each missing month should have been added 4 times for the 4 weeks. “ForLater”

Code

# #Data
bb <- xxJdata %>% 
  rename(Dates = "Month-Year", Users = "No of users") %>% 
  mutate(across(Dates, as_date)) %>% 
  select(Dates, Users, Usage)
#
# #Missing Months
ii <- tibble(Dates = seq(min(bb$Dates), max(bb$Dates), by = "months"))
jj <- right_join(bb, ii, by = "Dates") %>% arrange(Dates) %>% mutate(across(Users, coalesce, 0)) 
#
# #Create 2 Groups
jj$Year <- cut(jj$Dates, breaks = c(min(ii$Dates), as_date("2017-01-01"), Inf), 
               labels = c("2015-2016", "2017-2018"))
#
# #Verify Changes
jj[!duplicated(jj$Year), ]
## # A tibble: 2 x 4
##   Dates      Users Usage Year     
##   <date>     <dbl> <dbl> <fct>    
## 1 2015-06-01     2     4 2015-2016
## 2 2017-01-01    92   495 2017-2018
jj %>% filter(Dates %in% ymd(c("2016-12-01", "2017-01-01")))
## # A tibble: 8 x 4
##   Dates      Users Usage Year     
##   <date>     <dbl> <dbl> <fct>    
## 1 2016-12-01    50   536 2015-2016
## 2 2016-12-01    54   318 2015-2016
## 3 2016-12-01    99   558 2015-2016
## 4 2016-12-01   104   573 2015-2016
## 5 2017-01-01    92   495 2017-2018
## 6 2017-01-01   130   578 2017-2018
## 7 2017-01-01    87   436 2017-2018
## 8 2017-01-01    60   261 2017-2018
#
# #For Two Sample t-test, check if Variances are equal
jj_var <- var.test(Users ~ Year, data = jj)
jj_var
## 
##  F test to compare two variances
## 
## data:  Users by Year
## F = 0.10358, num df = 72, denom df = 52, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.06158444 0.17055359
## sample estimates:
## ratio of variances 
##          0.1035754
#
# #If Variances are Equal, Pooled Test otherwise Welch Test
isVarEqual <- ifelse(jj_var$p.value > 0.05, TRUE, FALSE)
#
# #Because 1 is "2015-2016", 2 is "2017-2018", we need to perform Lower Tail Test
jj_t <- t.test(formula = Users ~ Year, data = jj, alternative = "less", var.equal = isVarEqual)
jj_t
## 
##  Welch Two Sample t-test
## 
## data:  Users by Year
## t = -7.255, df = 59.87, p-value = 4.641e-10
## alternative hypothesis: true difference in means between group 2015-2016 and group 2017-2018 is less than 0
## 95 percent confidence interval:
##       -Inf -92.96694
## sample estimates:
## mean in group 2015-2016 mean in group 2017-2018 
##                50.06849               170.84906
#
# #Alternatively, we can reverse Factor levels to perform Upper Tail Test
kk <- jj
kk$Year <- factor(kk$Year, levels = rev(levels(jj$Year)))
#
t.test(formula = Users ~ Year, data = kk, alternative = "greater", var.equal = isVarEqual)
## 
##  Welch Two Sample t-test
## 
## data:  Users by Year
## t = 7.255, df = 59.87, p-value = 4.641e-10
## alternative hypothesis: true difference in means between group 2017-2018 and group 2015-2016 is greater than 0
## 95 percent confidence interval:
##  92.96694      Inf
## sample estimates:
## mean in group 2017-2018 mean in group 2015-2016 
##               170.84906                50.06849

4b

# #Data
str(jj)
## tibble [126 x 4] (S3: tbl_df/tbl/data.frame)
##  $ Dates: Date[1:126], format: "2015-06-01" "2015-07-01" "2015-07-01" ...
##  $ Users: num [1:126] 2 1 1 4 6 12 13 10 7 12 ...
##  $ Usage: num [1:126] 4 1 25 70 100 291 225 141 148 215 ...
##  $ Year : Factor w/ 2 levels "2015-2016","2017-2018": 1 1 1 1 1 1 1 1 1 1 ...
# #For Two Sample t-test, check if Variances are equal
jj_var <- var.test(Usage ~ Year, data = jj)
jj_var
## 
##  F test to compare two variances
## 
## data:  Usage by Year
## F = 0.8706, num df = 72, denom df = 49, p-value = 0.5853
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.5117114 1.4434337
## sample estimates:
## ratio of variances 
##          0.8705962
#
# #If Variances are Equal, Pooled Test otherwise Welch Test
isVarEqual <- ifelse(jj_var$p.value > 0.05, TRUE, FALSE)
#
# #Because 1 is "2015-2016", 2 is "2017-2018", we need to perform Lower Tail Test
t.test(formula = Usage ~ Year, data = jj, alternative = "less", var.equal = isVarEqual)
## 
##  Two Sample t-test
## 
## data:  Usage by Year
## t = -4.7369, df = 121, p-value = 2.981e-06
## alternative hypothesis: true difference in means between group 2015-2016 and group 2017-2018 is less than 0
## 95 percent confidence interval:
##     -Inf -252.37
## sample estimates:
## mean in group 2015-2016 mean in group 2017-2018 
##                424.6849                812.9000

7.3.5 Q4

Check whether app usage is same or different across the four weeks of a month. Test the claim that app usage picked up after January 2016. (Answered with Q3)

NOTE: Question is ‘check whether app usage is same or different across the four weeks of a month, using Jan-2016 - May-2018 data.’ However, as seen in the figure 7.1, this time period has 3 months missing data and completely different usage pattern after that, I believe that testing only this data would give biased results. So, this was not done.

28.2 \(\text{\{ANOVA\}} {H_0} : {\mu}_1 = {\mu}_2 = \dots = {\mu}_k \iff {H_a}: \text{Not all population means are equal}\)

  • ANOVA is needed because more than two means are to be compared
    • Data groups are not normal and neither the variances are equal
    • Log transformation failed, Residual Test failed
      • NOTE: Data shown in class is subset “Jan-2016 - May-2018” i.e. log transformation of this dataset becomes normal.
    • However, Sqrt transformation passed Normality Test as well as Variances are found to be same
  • Using Sqrt of Data \({}^U\!P_{(F)} > {\alpha} \to {H_0}\) cannot be rejected
    • The sample results do not provide sufficient evidence to conclude that the app usage is different across the four weeks of a month.

Question: When ANOVA is done on transformed data and a conclusion is reached. Does this imply that the original data would also follow same conclusion - Look at the p-value of Transformed data for accepting or rejecting the Hypothesis. But, look at the mean of original data to apply those conclusions. - ANOVA p-value is NOT trustworthy if the data is NOT Normal.

Question: When we are running any test, should we check whether the data is normal - Yes.

Images

JAT: QQ Plot of Usage and Sqrt(Usage)JAT: QQ Plot of Usage and Sqrt(Usage)

Figure 7.2 JAT: QQ Plot of Usage and Sqrt(Usage)

Anova & Kruskal

# #Data | Missing Months can be ignored because those are missing across all weeks
bb <- xxJdata %>% 
  rename(Dates = "Month-Year", Users = "No of users") %>% 
  select(Week, Usage)
#
str(bb)
## tibble [123 x 2] (S3: tbl_df/tbl/data.frame)
##  $ Week : chr [1:123] "Week4" "Week1" "Week2" "Week3" ...
##  $ Usage: num [1:123] 4 1 25 70 100 291 225 141 148 215 ...
summary(bb)
##      Week               Usage       
##  Length:123         Min.   :   1.0  
##  Class :character   1st Qu.: 286.5  
##  Mode  :character   Median : 450.0  
##                     Mean   : 582.5  
##                     3rd Qu.: 749.5  
##                     Max.   :3462.0
#
# #ANOVA (on original data : neither normal, nor of equal variance)
bb_aov <- aov(formula = Usage ~ Week, data = bb)
#
# #
model.tables(bb_aov, type = "means")
## Tables of means
## Grand mean
##          
## 582.4959 
## 
##  Week 
##     Week1 Week2 Week3 Week4
##     551.9 522.2   480 764.7
## rep  31.0  30.0    30  32.0
#
# #Summary
summary(bb_aov)
##              Df   Sum Sq Mean Sq F value Pr(>F)  
## Week          3  1515178  505059    2.22 0.0894 .
## Residuals   119 27074553  227517                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
bb_aov
## Call:
##    aov(formula = Usage ~ Week, data = bb)
## 
## Terms:
##                     Week Residuals
## Sum of Squares   1515178  27074553
## Deg. of Freedom        3       119
## 
## Residual standard error: 476.9877
## Estimated effects may be unbalanced
#
kruskal.test(Usage ~ Week, data = bb)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Usage by Week
## Kruskal-Wallis chi-squared = 3.1614, df = 3, p-value = 0.3674
#
# #Poisson Test (ForLater)
#anova(glm(Usage ~ Week, data = ii, family = poisson), test = "LRT")
#
# #Transformation: Square Root Data
ii <- bb %>% mutate(Week = factor(Week)) %>% mutate(Sqrt = sqrt(Usage))
str(ii)
## tibble [123 x 3] (S3: tbl_df/tbl/data.frame)
##  $ Week : Factor w/ 4 levels "Week1","Week2",..: 4 1 2 3 4 1 2 3 4 1 ...
##  $ Usage: num [1:123] 4 1 25 70 100 291 225 141 148 215 ...
##  $ Sqrt : num [1:123] 2 1 5 8.37 10 ...
summary(ii)
##     Week        Usage             Sqrt      
##  Week1:31   Min.   :   1.0   Min.   : 1.00  
##  Week2:30   1st Qu.: 286.5   1st Qu.:16.93  
##  Week3:30   Median : 450.0   Median :21.21  
##  Week4:32   Mean   : 582.5   Mean   :22.35  
##             3rd Qu.: 749.5   3rd Qu.:27.38  
##             Max.   :3462.0   Max.   :58.84
#
# #ANOVA
ii_aov <- aov(formula = Sqrt ~ Week, data = ii)
# #
model.tables(ii_aov, type = "means")
## Tables of means
## Grand mean
##          
## 22.35259 
## 
##  Week 
##     Week1 Week2 Week3 Week4
##     21.98 21.57 20.74 24.95
## rep 31.00 30.00 30.00 32.00
#
# #Summary
summary(ii_aov)
##              Df Sum Sq Mean Sq F value Pr(>F)
## Week          3    317  105.75   1.274  0.286
## Residuals   119   9874   82.98
#
kruskal.test(Sqrt ~ Week, data = ii)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Sqrt by Week
## Kruskal-Wallis chi-squared = 3.1614, df = 3, p-value = 0.3674

Variance

(External) Variances in R

Statistical tests for comparing the variances of two or more samples. Equal variances across samples is called homogeneity of variances.

  • Reason
    • Two independent samples T-test and ANOVA test, assume that variances are equal across groups.
  • Statistical tests for comparing variances
    • F-test: Compare the variances of two samples. The data must be normally distributed.
    • Bartlett test: Compare the variances of k samples, where k can be more than two samples.
      • The data must be normally distributed.
      • The Levene test is an alternative to the Bartlett test that is less sensitive to departures from normality.
    • Levene test: Compare the variances of k samples, where k can be more than two samples.
      • It is an alternative to the Bartlett test that is less sensitive to departures from normality.
    • Fligner-Killeen test: a non-parametric test which is very robust against departures from normality.
  • Hypothesis
    • For all these tests (Bartlett test, Levene test or Fligner-Killeen test):
Definition 7.1 \(\text{\{Variances\}} {H_0} : {\sigma}_1 = {\sigma}_2 = \dots = {\sigma}_k \iff {H_a}: \text{At least two variances differ.}\)
  • On this data, all 3 tests have p-value less than 0.05, i.e. Variances are NOT same
  • On the Transformed Data (Sqrt), Levene Test and Fligner Test fail to detect difference in Variances
# #Data
str(bb)
## tibble [123 x 2] (S3: tbl_df/tbl/data.frame)
##  $ Week : chr [1:123] "Week4" "Week1" "Week2" "Week3" ...
##  $ Usage: num [1:123] 4 1 25 70 100 291 225 141 148 215 ...
summary(bb)
##      Week               Usage       
##  Length:123         Min.   :   1.0  
##  Class :character   1st Qu.: 286.5  
##  Mode  :character   Median : 450.0  
##                     Mean   : 582.5  
##                     3rd Qu.: 749.5  
##                     Max.   :3462.0
#
# #Bartlett Test
bartlett.test(Usage ~ Week, data = bb)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  Usage by Week
## Bartlett's K-squared = 25.89, df = 3, p-value = 1.006e-05
#
# #Levene Test
ii <- bb %>% mutate(Week = factor(Week))
leveneTest(Usage ~ Week, data = ii)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)  
## group   3  2.8691 0.0394 *
##       119                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# #Fligner-Killeen test
fligner.test(Usage ~ Week, data = bb)
## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  Usage by Week
## Fligner-Killeen:med chi-squared = 8.5854, df = 3, p-value = 0.03534
#
# #Transformation: Square Root Data
ii <- bb %>% mutate(Week = factor(Week)) %>% mutate(Sqrt = sqrt(Usage))
bartlett.test(Sqrt ~ Week, data = ii)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  Sqrt by Week
## Bartlett's K-squared = 10.656, df = 3, p-value = 0.01374
leveneTest(Sqrt ~ Week, data = ii)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   3  2.4529 0.06668 .
##       119                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fligner.test(Sqrt ~ Week, data = ii)
## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  Sqrt by Week
## Fligner-Killeen:med chi-squared = 6.6688, df = 3, p-value = 0.08324

Normality

# #Are the data from each of the 4 groups follow a normal distribution
# #Shapiro-Wilk normality test
bb %>% mutate(Week = factor(Week)) %>% 
  group_by(Week) %>% 
  summarise(N = n(), Mean = mean(Usage), SD = sd(Usage),
            p_Shapiro = shapiro.test(Usage)$p.value, 
            isNormal = ifelse(p_Shapiro > 0.05, TRUE, FALSE))
## # A tibble: 4 x 6
##   Week      N  Mean    SD p_Shapiro isNormal
##   <fct> <int> <dbl> <dbl>     <dbl> <lgl>   
## 1 Week1    31  552.  384. 0.00631   FALSE   
## 2 Week2    30  522.  353. 0.0173    FALSE   
## 3 Week3    30  480.  333. 0.000843  FALSE   
## 4 Week4    32  765.  714. 0.0000964 FALSE
# #Check Q-Q plot
#qqnorm(bb[bb$Week == "Week1", ]$Usage)
#
# #Transformation: Log (Did not pass Normality)
bb %>% mutate(Week = factor(Week)) %>% 
  mutate(Log = log(Usage)) %>% 
  group_by(Week) %>% 
  summarise(p_Shapiro = shapiro.test(Log)$p.value, 
            isNormal = ifelse(p_Shapiro > 0.05, TRUE, FALSE))
## # A tibble: 4 x 3
##   Week    p_Shapiro isNormal
##   <fct>       <dbl> <lgl>   
## 1 Week1 0.000000533 FALSE   
## 2 Week2 0.0314      FALSE   
## 3 Week3 0.689       TRUE    
## 4 Week4 0.00125     FALSE
#
# #Transformation: Square Root (Success: Passed Normality) - Selected
bb %>% mutate(Week = factor(Week)) %>% 
  mutate(Sqrt = sqrt(Usage)) %>% 
  group_by(Week) %>% 
  summarise(p_Shapiro = shapiro.test(Sqrt)$p.value, 
            isNormal = ifelse(p_Shapiro > 0.05, TRUE, FALSE))
## # A tibble: 4 x 3
##   Week  p_Shapiro isNormal
##   <fct>     <dbl> <lgl>   
## 1 Week1     0.493 TRUE    
## 2 Week2     0.967 TRUE    
## 3 Week3     0.108 TRUE    
## 4 Week4     0.732 TRUE
#
# #Transformation: Cube Root (Success: Passed Normality) Just to check
bb %>% mutate(Week = factor(Week)) %>% 
  mutate(CubeRoot = Usage^(1/3)) %>% 
  group_by(Week) %>% 
  summarise(p_Shapiro = shapiro.test(CubeRoot)$p.value, 
            isNormal = ifelse(p_Shapiro > 0.05, TRUE, FALSE))
## # A tibble: 4 x 3
##   Week  p_Shapiro isNormal
##   <fct>     <dbl> <lgl>   
## 1 Week1    0.0979 TRUE    
## 2 Week2    0.980  TRUE    
## 3 Week3    0.359  TRUE    
## 4 Week4    0.997  TRUE
#
# #Testing Residuals i.e. Data - Group Mean (Did not pass Normality)
bb %>% mutate(Week = factor(Week)) %>% 
  group_by(Week) %>% 
  mutate(Residuals = Usage - mean(Usage)) %>% 
  summarise(p_Shapiro = shapiro.test(Residuals)$p.value, 
            isNormal = ifelse(p_Shapiro > 0.05, TRUE, FALSE))
## # A tibble: 4 x 3
##   Week  p_Shapiro isNormal
##   <fct>     <dbl> <lgl>   
## 1 Week1 0.00631   FALSE   
## 2 Week2 0.0173    FALSE   
## 3 Week3 0.000843  FALSE   
## 4 Week4 0.0000964 FALSE

QQ Plot

bb <- xxJdata %>% 
  rename(Dates = "Month-Year", Users = "No of users") %>% 
  mutate(Week = factor(Week)) %>% 
  mutate(Sqrt = sqrt(Usage)) %>% 
  select(Week, Usage, Sqrt)
#
hh <- bb
ttl_hh <- "QQ Plot of Usage"
cap_hh <- "B15P04"
#
B15 <- hh %>% { ggplot(., aes(sample = Usage, colour = Week)) +
    stat_qq() +
    stat_qq_line() +
    labs(caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B15)
rm(B15)
#
ttl_hh <- "QQ Plot of Sqrt(Usage)"
cap_hh <- "B15P05"
B15 <- hh %>% { ggplot(., aes(sample = Sqrt, colour = Week)) +
    stat_qq() +
    stat_qq_line() +
    labs(caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B15)
rm(B15)

7.3.6 Q5

25.2 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu}_1 - {\mu}_2 \geq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 < {D_0}\)

A new version of the app was released in August-2016. Which month in the given time frame after the launch of the new version, the mean usage pattern would start to show a statistically significant shift

  • (1: OldApp, 2: NewApp) Lower Tail Test is required
  • Because the Variances are NOT same, Welch t-test is required, we cannot use Pooled Test
  • \({}^L\!P_{t} < {\alpha} \to {H_0}\) is rejected
    • The sample results provide sufficient evidence to conclude that the mean usage has increased after August-2016.

Basic

# #Data
bb <- xxJdata %>% rename(Dates = "Month-Year") %>% mutate(across(Dates, as_date)) %>% 
  select(Dates, Week, Usage)
#
# #Create 2 Groups
bb$Year <- cut(bb$Dates, breaks = c(min(bb$Dates), as_date("2016-08-01"), Inf), 
               labels = c("OldApp", "NewApp"))
#
# #For Two Sample t-test, check if Variances are equal
bb_var <- var.test(Usage ~ Year, data = bb)
bb_var
## 
##  F test to compare two variances
## 
## data:  Usage by Year
## F = 0.16799, num df = 52, denom df = 69, p-value = 4.775e-10
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.1014182 0.2835174
## sample estimates:
## ratio of variances 
##          0.1679867
#
# #If Variances are Equal, Pooled Test otherwise Welch Test
isVarEqual <- ifelse(bb_var$p.value > 0.05, TRUE, FALSE)
#
# #Because 1 is "OldApp", 2 is "NewApp", we need to perform Lower Tail Test
bb_t <- t.test(formula = Usage ~ Year, data = bb, alternative = "less", var.equal = isVarEqual)
bb_t
## 
##  Welch Two Sample t-test
## 
## data:  Usage by Year
## t = -6.0938, df = 96.698, p-value = 1.124e-08
## alternative hypothesis: true difference in means between group OldApp and group NewApp is less than 0
## 95 percent confidence interval:
##       -Inf -317.4977
## sample estimates:
## mean in group OldApp mean in group NewApp 
##             334.1132             770.5571
#

rollapply()

# #Rolling Sums
old_sd <- sd(bb[bb$Year == "OldApp", ]$Usage)
old_n <- summary(bb$Year)[1]
#
ii <- bb %>% filter(Year == "NewApp", Dates <= '2017-05-01') %>% select(-Year)
#
jj <- ii %>% mutate(ID = row_number(), cSUM = cumsum(Usage), cMean = cSUM/ID) %>% 
  mutate(SD = across(Usage, ~ rollapply(., ID, sd, fill = NA, align = "right"))) %>% 
  mutate(DOF = floor({SD^2 / ID + old_sd^2 / old_n }^2 / 
                       {{SD^2 / ID}^2/{ID-1} + {old_sd^2 / old_n}^2/{old_n-1}})) %>% 
  mutate(Sigma = sqrt({SD^2 /ID + old_sd^2 /old_n}))
str(jj)
## tibble [40 x 9] (S3: tbl_df/tbl/data.frame)
##  $ Dates: Date[1:40], format: "2016-08-01" "2016-08-01" "2016-08-01" ...
##  $ Week : chr [1:40] "Week1" "Week2" "Week3" "Week4" ...
##  $ Usage: num [1:40] 421 387 264 788 691 256 261 377 295 749 ...
##  $ ID   : int [1:40] 1 2 3 4 5 6 7 8 9 10 ...
##  $ cSUM : num [1:40] 421 808 1072 1860 2551 ...
##  $ cMean: num [1:40] 421 404 357 465 510 ...
##  $ SD   : tibble [40 x 1] (S3: tbl_df/tbl/data.frame)
##   ..$ Usage: num [1:40] NA 24 82.6 225.6 220 ...
##  $ DOF  :'data.frame':   40 obs. of  1 variable:
##   ..$ Usage: num [1:40] NA 14 3 3 4 6 7 9 11 13 ...
##  $ Sigma:'data.frame':   40 obs. of  1 variable:
##   ..$ Usage: num [1:40] NA 34.9 56.6 116.9 103 ...

7.3.7 Q6

If a disease is likely to spread in particular weather condition (data given in the disease index sheet), then the access of that disease should be more in the months having suitable weather conditions. Help the analyst in coming up with a statistical test to support the claim for two districts for which the sample of weather and disease access data is provided in the data sheet. Identify the diseases for which you can support this claim. Test this claim both for temperature and relative humidity at 95% confidence.

25.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)

Table 7.1: (B15T01) Q6: Diseases p-value (Upper)
p- Var Equal Var p- t-test Upper H0
D1 ~ iD1 0.00002 FALSE 0.00084 Rejected
D2 ~ iD2 0.00009 FALSE 0.00076 Rejected
D3 ~ iD3 0.0002 FALSE 0.00399 Rejected
D4 ~ iD4 0.00029 FALSE 0.06659 Not Rejected
D5 ~ iD5 0.37691 TRUE 0.01814 Rejected
D7 ~ iD7 0.90392 TRUE 0.00239 Rejected
  • (1: Conditions Favourable for Disease i.e. TRUE, 2: FALSE) Upper Tail Test is required
  • Testing variance and applying Welch t-test (variances not same) or Pooled Test (variances are same)
  • Assuming both districts to be part of same sample
  • Values are given in the 7.1
    • \({}^U\!P_{t} > {\alpha} \to {H_0}\) cannot be rejected for disease 4
    • \({}^U\!P_{t} < {\alpha} \to {H_0}\) is rejected for diseases 1, 2, 3, 5, and 7
    • The sample results provide sufficient evidence to conclude that the claim is True for diseases 1, 2, 3, 5, and 7
    • Upper Tail Test was performed in place of Two Tail because it is assumed that there is no benefit in thinking that searches for a disease may decrease during the favourable conditions for it.
    • Both districts were combined with the thinking that political boundaries may not change the behaviour of people based on geographical factors (T & RH).
      • However, a case can be made when we are interested in observing difference in pattern of behaviour by the people of different districts. In that case, we would be performing a difference of means test between the two districts.

BoxPlot

JAT: Disease Searches grouped with Faourable and unfavourable Condisions (T and RH)

Figure 7.3 JAT: Disease Searches grouped with Faourable and unfavourable Condisions (T and RH)

Code

# #Merge both dataframes of Two districts
aa <- bind_rows(Belagavi = xxJbela, Dharwad = xxJdhar, .id = 'source') %>% 
  rename(Dates = Months, RH = "Relative Humidity", TMP = "Temperature") %>% 
  mutate(across(Dates, as_date)) %>% mutate(source = factor(source)) %>% 
  select(-c(10:13)) %>% select(-D6)
#
# #Based on Conditional T & RH, get each disease favourable condition = TRUE
q6_bb <- aa %>% mutate(iD1 = ifelse(TMP <= 24 & TMP >= 20 & RH > 80, TRUE, FALSE), 
                    iD2 = ifelse(TMP <= 24.5 & TMP >= 21.5 & RH > 83, TRUE, FALSE), 
                    iD3 = ifelse(TMP <= 24 & TMP >= 22, TRUE, FALSE), 
                    iD4 = ifelse(TMP <= 26 & TMP >= 22 & RH > 85, TRUE, FALSE), 
                    iD5 = ifelse(TMP <= 24.5 & TMP >= 22 & RH <= 85 & RH >= 77, TRUE, FALSE), 
                    iD7 = ifelse(TMP > 25 & RH > 80, TRUE, FALSE)) %>% 
  mutate(across(starts_with("i"), factor, levels = c(TRUE, FALSE)))
bb <- q6_bb
#
# #Create all Formulae for variance and t-test
formulas <- paste0(names(bb)[3:8], " ~ ", names(bb)[11:16])
#
# #Appply formulae
output <- t(sapply(formulas, function(f) {
    test_var <- var.test(as.formula(f), data = bb)
    isVarEqual <- ifelse(test_var$p.value > 0.05, TRUE, FALSE) 
    test_t <- t.test(as.formula(f), data = bb, alternative = "greater", var.equal = isVarEqual)
    c("p- Var" = format(round(test_var$p.value, 5), scientific = FALSE), 
      "Equal Var" = ifelse(test_var$p.value > 0.05, TRUE, FALSE), 
      "p- t-test" = format(round(test_t$p.value, 5), scientific = FALSE), 
      "Upper H0" = ifelse(test_t$p.value > 0.05, "Not Rejected", "Rejected"))
}))

BoxPlot More

bb <- q6_bb
hh <- q6_bb %>% 
  rename_with(~gsub("iD", "i", .x)) %>% 
  select(starts_with(c("D", "i"))) %>% 
  select(-Dates) %>% 
  pivot_longer(everything(), names_to = c(".value", "Disease"), names_pattern = "(.)(.)") %>%
  rename(Values = "D", Favourable = "i")
#
ttl_hh <- "BoxPlot of Searches for Diseases in both districts"
cap_hh <- "B15P03"
#
B15 <- hh %>% { ggplot(data = ., mapping = aes(x = Disease, y = Values, fill = Favourable)) +
        geom_boxplot(outlier.shape = NA) +
        #stat_summary(fun = mean, geom = "point", size = 2, color = "steelblue") + 
        #scale_y_continuous(breaks = seq(0, 110, 10), limits = c(0, 110)) +
        geom_point(position = position_jitterdodge(jitter.width = 0.1), 
                   size = 1, alpha = 0.7, colour = "#21908CFF") + 
        k_gglayer_box +
        theme(
            #legend.justification = c("right", "top"),
            #legend.box.just = "right",
            #legend.margin = margin(6, 6, 6, 6),
            legend.position = c(.90, .95)
        ) +
        labs(x = "Diseases", y = "Searches per month", fill = "Favourable",
             caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B15)
rm(B15)

Pattern Match

# #rename_with() uses formula
# #Selection Helpers like starts_with() can take multiple conditions
# #pivot_longer() can return multiple groups
# #Pattern Match: Both First and Second Pattern can contain only 1 character
ii <- q6_bb %>% 
  rename_with(~gsub("iD", "i", .x)) %>% 
  select(starts_with(c("D", "i"))) %>% select(-Dates) %>% 
  pivot_longer(everything(), names_to = c(".value", "Disease"), names_pattern = "(.)(.)") %>%
  rename(Values = "D", Favourable = "i")
#
# #First Pattern can contain 1 or more characters but the Second can have only 1 character
jj <- q6_bb %>% 
  select(starts_with(c("D", "i"))) %>% select(-Dates) %>% 
  pivot_longer(everything(), names_to = c(".value", "Disease"), names_pattern = "(.*)(.)") %>% 
  rename(Values = "D", Favourable = iD)
stopifnot(identical(ii, jj))

Validation


8 Data Preprocessing (B16, Oct-24)

8.1 Overview

8.2 Packages

if(FALSE){# #WARNING: Installation may take some time.
  install.packages("mice", dependencies = TRUE)
  install.packages("car", dependencies = TRUE)
}

8.3 Data

Please import the "B16-Cars2.csv".

Cars

Table 8.1: (B16T01) Cars Data (Head)
sl.No. mpg cylinders cubicinches hp weightlbs time.to.60 year brand
1 14.0 8 350 165 4209 12 1972 US
2 31.9 4 89 71 1925 14 1980 Europe
3 17.0 8 302 140 3449 11 1971 US
4 15.0 8 400 150 3761 10 1971 US
5 30.5 4 98 63 2051 17 1978 US
6 23.0 8 350 125 3900 17 1980 US

Structure

# #Structure
str(xxB16Cars)
## spec_tbl_df [263 x 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ sl.No.     : num [1:263] 1 2 3 4 5 6 7 8 9 10 ...
##  $ mpg        : num [1:263] 14 31.9 17 15 30.5 23 13 14 25.4 37.7 ...
##  $ cylinders  : num [1:263] 8 4 8 8 4 8 8 8 5 4 ...
##  $ cubicinches: num [1:263] 350 89 302 400 98 350 351 440 183 89 ...
##  $ hp         : num [1:263] 165 71 140 150 63 125 158 215 77 62 ...
##  $ weightlbs  : num [1:263] 4209 1925 3449 3761 2051 ...
##  $ time.to.60 : num [1:263] 12 14 11 10 17 17 13 9 20 17 ...
##  $ year       : num [1:263] 1972 1980 1971 1971 1978 ...
##  $ brand      : chr [1:263] "US" "Europe" "US" "US" ...

Summary

# #Summary
summary(xxB16Cars)
##      sl.No.           mpg           cylinders      cubicinches          hp          weightlbs     
##  Min.   :  1.0   Min.   : 10.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   : 192.5  
##  1st Qu.: 66.5   1st Qu.: 16.95   1st Qu.:4.000   1st Qu.:103.0   1st Qu.: 75.5   1st Qu.:2245.5  
##  Median :132.0   Median : 22.00   Median :6.000   Median :156.0   Median : 95.0   Median :2830.0  
##  Mean   :132.0   Mean   : 25.07   Mean   :5.593   Mean   :201.5   Mean   :106.3   Mean   :2992.9  
##  3rd Qu.:197.5   3rd Qu.: 28.90   3rd Qu.:8.000   3rd Qu.:302.0   3rd Qu.:137.5   3rd Qu.:3654.5  
##  Max.   :263.0   Max.   :527.00   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :4997.0  
##    time.to.60         year         brand          
##  Min.   : 8.00   Min.   :1971   Length:263        
##  1st Qu.:14.00   1st Qu.:1974   Class :character  
##  Median :16.00   Median :1977   Mode  :character  
##  Mean   :15.54   Mean   :1977                     
##  3rd Qu.:17.00   3rd Qu.:1980                     
##  Max.   :25.00   Max.   :1983

8.4 Missing Data

8.5 Imputation

Definition 8.1 Imputation is the process of replacing missing data with substituted values. Imputation preserves all cases by replacing missing data with an estimated value based on other available information.
  • Imputation
    • What would be the most likely value for this missing value, given all the other attributes for a particular record
    • “Multivariate Imputation by Chained Equations” (MICE) is used for Imputation.
  • A common method of “handling” missing values is simply to omit the records or fields with missing values from the analysis. However, there are issues with this approach.
    • Sometimes it is not feasible or desirable to delete all the rows containing missing values
    • Sometimes the data is deliberately missing
      • Example: Customers with high value transactions have higher missing income data
        • This is also a pattern and we cannot simply remove these customers from our analysis
    • Question: What if there are only a small number of missing values in a large dataset
      • Even then also, we need to look at whether it is a meaningful data or not. Some data points are so critical that they cannot be removed. If you delete them, you might be loosing highly relevant information.
    • Question: Only 15 rows in a 1-lakh dataset
      • Do the rows belong to a new product which is relevant to our analysis
      • While these rows may not have mileage information of a car, other columns would have pricing and other critical information. Deleting these rows may result in loss of this important data.
      • It would be a waste to omit the information in all the other fields, just because one field value is missing.
  • Mean value replacement
    • Refer Mean
    • Replace the missing value by ‘mean’ of the data
    • However, the ‘mean’ is affected by presence of extreme values (outliers)
  • Median replacement
    • Refer Median
    • Although the mean is the more commonly used measure of central location, whenever a data set contains extreme values, the median is preferred.
    • Median is always a better measure for replacement compared to mean
  • Mode replacement
    • Refer Mode
    • For Categorical variable, mode is preferred.
    • However, using mode for NA replacement results in increasing the frequency of most frequent item.
  • Random value replacement
    • Replace NA with a random value from the observed distribution of the variable.
    • However, the resulting observation (row) might not make sense in terms of grouping of variables (columns).
  • Problems:
    • These simple approaches usually introduce bias into the data
      • Ex: Applying mean substitution leaves the mean unchanged (desirable) but decreases variance (undesirable). The resulting confidence levels for statistical inference will be overoptimistic, as measures of spread will be artificially reduced.

8.5.1 Introduce NA

aa <- xxB16Cars #No missing value
bb <- aa     #Will have missing value later
#
# #Identify the Number of Missing Values
if(anyNA(bb)) {
  cat(paste0("NA are Present! Total NA = ", sum(is.na(bb)), "\n")) 
  } else cat(paste0("NA not found.\n"))
## NA not found.
#
# #Record Some Values, before deleteting them
bb_22 <- bb$mpg[2]        #bb[2, 2] #31.9
bb_39 <- bb$brand[3]      #bb[3, 9] #"US"
bb_43 <- bb$cylinders[4]  #bb[4, 3] #8
#
# #Delete 
bb$mpg[2] <- bb$brand[3] <- bb$cylinders[4] <- NA
#
# #Identify the Number of Missing Values
cat(paste0("NA are Present! Total NA = ", sum(is.na(bb)), "\n")) 
## NA are Present! Total NA = 3
#
# #Which Columns have NA
#summary(bb)
bb_na_col <- colSums(is.na(bb))
#
# #Column Names with their Column Index
which(bb_na_col != 0)
##       mpg cylinders     brand 
##         2         3         9
#
# #Number of NA in each Column
bb_na_col[which(bb_na_col != 0)]
##       mpg cylinders     brand 
##         1         1         1
#
# #How many rows contain NA
sum(!complete.cases(bb))
## [1] 3
#
# #Indices of Rows with NA
head(which(!complete.cases(bb)))
## [1] 2 3 4
#

8.5.2 summary() NA

  • Caution:
    • summary() does not identify NA in character column but shows them in factor
    • table() does not identify NA by default but shows them by ‘useNA’
ii <- bb
summary(ii$brand)
##    Length     Class      Mode 
##       263 character character
summary(factor(ii$brand))
## Europe  Japan     US   NA's 
##     48     51    163      1
#
# #table() by default does not show NA even in factor. However it has 'useNA' option
table(ii$brand)
## 
## Europe  Japan     US 
##     48     51    163
table(factor(ii$brand))
## 
## Europe  Japan     US 
##     48     51    163
table(ii$brand, useNA = "always")
## 
## Europe  Japan     US   <NA> 
##     48     51    163      1

8.5.3 Mean and Median Replacement

  • ‘na.rm = TRUE‘: All NA values will be ignored for the calculation.
ii <- bb
# #Mean Replacement
#ii %>% mutate(across(mpg, ~ replace(., is.na(.), round(mean(mpg, na.rm = TRUE), 2))))
ii$mpg[which(is.na(ii$mpg))] <- round(mean(ii$mpg, na.rm = TRUE), digits = 2)
jj <- ii
#
# #Median Replacement
ii <- bb
#ii %>% mutate(across(mpg, replace_na, round(median(mpg, na.rm = TRUE), 2)))
ii$mpg[which(is.na(ii$mpg))] <- round(median(ii$mpg, na.rm = TRUE), digits = 2)

8.5.4 Mode Replacement

table(bb$brand, useNA = "always")
## 
## Europe  Japan     US   <NA> 
##     48     51    163      1
bb %>% group_by(brand) %>% summarise(n()) 
## # A tibble: 4 x 2
##   brand  `n()`
##   <chr>  <int>
## 1 Europe    48
## 2 Japan     51
## 3 US       163
## 4 <NA>       1
#
# #Mode Replacement
#ii$brand[which(is.na(ii$brand))] <- f_getMode(ii$brand)
ii$brand[which(is.na(ii$brand))] <- "US"
#
# #Caution: Do not use max() on "character" for mode replacement
# #It will only look for ASCII value of letters
ii <- c("a", "z", "c", "b", "b", "b", "USA", NA, "a")
max(ii, na.rm = TRUE) #Wrong Value
## [1] "z"
f_getMode(ii) 
## [1] "b"

8.5.5 md.pattern()

  • mice::md.pattern()
    • It gives Number of NA in each column and those Rows which have missing NA
    • This is useful when there is correlation in the missing values of two or more columns
# #Convert to Factor before using MICE
bb$brand <- factor(bb$brand)
#
# #mice::md.pattern() 
na_bb <- md.pattern(bb, plot = FALSE)
na_bb
##     sl.No. cubicinches hp weightlbs time.to.60 year mpg cylinders brand  
## 260      1           1  1         1          1    1   1         1     1 0
## 1        1           1  1         1          1    1   1         1     0 1
## 1        1           1  1         1          1    1   1         0     1 1
## 1        1           1  1         1          1    1   0         1     1 1
##          0           0  0         0          0    0   1         1     1 3
Cars: Inserted Missing Value Pattern by md.pattern()

Figure 8.1 Cars: Inserted Missing Value Pattern by md.pattern()

8.5.6 Seed

  • set.seed()
    • Random Number Generation can be fixed by a Seed for Reproducibility or Replication
    • The number given as seed ‘3’ is not meaningful. It can be anything.
    • However, it is recommended to use same number as seed throughout the calculations to avoid perception of fixing the values.
# #Choose Two Numbers from 1:10, Randomly
sample(1:10, 2)
## [1] 7 2
sample(1:10, 2)
## [1] 7 4
sample(1:10, 2)
## [1]  3 10
# #All above calls to generate Two random numbers produce different outcomes
# #Using set.seed() we can regenerate same random numbers everytime
set.seed(3)
sample(1:10, 2)
## [1] 5 7
sample(1:10, 2)
## [1] 4 8
#
set.seed(3)
sample(1:10, 2)
## [1] 5 7
sample(1:10, 2)
## [1] 4 8

8.5.7 mice()

MICE

  • package:mice is used for imputation
    • “Multivariate Imputation by Chained Equations”
      • It does imputation of a factor vector based on a numeric vector
    • Question: What happens if any categorical variable is associated with multiple numeric values. For example, if Car Honda has multiple mileage values
      • It will look at all other columns of data and based on these multiple columns identify a pattern which will be used for imputation.
    • Advantage:
      • During the mean replacement, we are using only one column for imputation. MICE is more robust because it looks for pattern in multiple columns
    • m = 2 is number of imputed sets. It shows up as “imp” column 1 2 in the output. It does not mean that only 2 columns have NA.
    • iter from 1 to 5 is number of iterations
    • Caution: Check if the column is “factor.” A “character” column will not be imputed and it will retain its NA.
    • Question: For a categorical variable does it always give ‘mode’
      • NO, it looks at the pattern based on other variables
# #Convert to Factor before using MICE
bb$brand <- factor(bb$brand)

# #mice() for imputation
# #Including all relevant data i.e. skipping Serial Number only
impute <- mice(bb[, 2:9], m = 2, seed = 3)
## 
##  iter imp variable
##   1   1  mpg  cylinders  brand
##   1   2  mpg  cylinders  brand
##   2   1  mpg  cylinders  brand
##   2   2  mpg  cylinders  brand
##   3   1  mpg  cylinders  brand
##   3   2  mpg  cylinders  brand
##   4   1  mpg  cylinders  brand
##   4   2  mpg  cylinders  brand
##   5   1  mpg  cylinders  brand
##   5   2  mpg  cylinders  brand
#
print(impute)
## Class: mids
## Number of multiple imputations:  2 
## Imputation methods:
##         mpg   cylinders cubicinches          hp   weightlbs  time.to.60        year       brand 
##       "pmm"       "pmm"          ""          ""          ""          ""          ""   "polyreg" 
## PredictorMatrix:
##             mpg cylinders cubicinches hp weightlbs time.to.60 year brand
## mpg           0         1           1  1         1          1    1     1
## cylinders     1         0           1  1         1          1    1     1
## cubicinches   1         1           0  1         1          1    1     1
## hp            1         1           1  0         1          1    1     1
## weightlbs     1         1           1  1         0          1    1     1
## time.to.60    1         1           1  1         1          0    1     1
#
# #For each iteration we have a different set of imputed data 
# #e.g. for 'mpg' in two sets values are
impute$imp$mpg
##    1  2
## 2 28 14
#
# #NOTE: Original Values that were removed earlier 
tibble(mpg = bb_22, brand = bb_39, cylinders = bb_43)
## # A tibble: 1 x 3
##     mpg brand cylinders
##   <dbl> <chr>     <dbl>
## 1  31.9 US            8
#
# #Complete First Set
set1_bb <- complete(impute, 1)
tibble(mpg = set1_bb$mpg[2], brand = set1_bb$brand[3], cylinders = set1_bb$cylinders[4])
## # A tibble: 1 x 3
##     mpg brand cylinders
##   <dbl> <fct>     <dbl>
## 1    28 US            8
#
# #Complete Second Set
set2_bb <- complete(impute, 2)
tibble(mpg = set2_bb$mpg[2], brand = set2_bb$brand[3], cylinders = set2_bb$cylinders[4])
## # A tibble: 1 x 3
##     mpg brand cylinders
##   <dbl> <fct>     <dbl>
## 1    14 US            8

MICE More

set.seed(3)
ii <- mice(bb[, 2:9], m = 3)
## 
##  iter imp variable
##   1   1  mpg  cylinders  brand
##   1   2  mpg  cylinders  brand
##   1   3  mpg  cylinders  brand
##   2   1  mpg  cylinders  brand
##   2   2  mpg  cylinders  brand
##   2   3  mpg  cylinders  brand
##   3   1  mpg  cylinders  brand
##   3   2  mpg  cylinders  brand
##   3   3  mpg  cylinders  brand
##   4   1  mpg  cylinders  brand
##   4   2  mpg  cylinders  brand
##   4   3  mpg  cylinders  brand
##   5   1  mpg  cylinders  brand
##   5   2  mpg  cylinders  brand
##   5   3  mpg  cylinders  brand
set.seed(3)
jj <- mice(bb[, 2:9], m = 3)
## 
##  iter imp variable
##   1   1  mpg  cylinders  brand
##   1   2  mpg  cylinders  brand
##   1   3  mpg  cylinders  brand
##   2   1  mpg  cylinders  brand
##   2   2  mpg  cylinders  brand
##   2   3  mpg  cylinders  brand
##   3   1  mpg  cylinders  brand
##   3   2  mpg  cylinders  brand
##   3   3  mpg  cylinders  brand
##   4   1  mpg  cylinders  brand
##   4   2  mpg  cylinders  brand
##   4   3  mpg  cylinders  brand
##   5   1  mpg  cylinders  brand
##   5   2  mpg  cylinders  brand
##   5   3  mpg  cylinders  brand
#
# #identical() is FALSE but all.equal() is TRUE
identical(ii, jj)
## [1] FALSE
all.equal(ii, jj)
## [1] TRUE
#
# #Similarly
ii <- mice(bb[, 2:9], m = 3, seed = 3)
## 
##  iter imp variable
##   1   1  mpg  cylinders  brand
##   1   2  mpg  cylinders  brand
##   1   3  mpg  cylinders  brand
##   2   1  mpg  cylinders  brand
##   2   2  mpg  cylinders  brand
##   2   3  mpg  cylinders  brand
##   3   1  mpg  cylinders  brand
##   3   2  mpg  cylinders  brand
##   3   3  mpg  cylinders  brand
##   4   1  mpg  cylinders  brand
##   4   2  mpg  cylinders  brand
##   4   3  mpg  cylinders  brand
##   5   1  mpg  cylinders  brand
##   5   2  mpg  cylinders  brand
##   5   3  mpg  cylinders  brand
jj <- mice(bb[, 2:9], m = 3, seed = 3)
## 
##  iter imp variable
##   1   1  mpg  cylinders  brand
##   1   2  mpg  cylinders  brand
##   1   3  mpg  cylinders  brand
##   2   1  mpg  cylinders  brand
##   2   2  mpg  cylinders  brand
##   2   3  mpg  cylinders  brand
##   3   1  mpg  cylinders  brand
##   3   2  mpg  cylinders  brand
##   3   3  mpg  cylinders  brand
##   4   1  mpg  cylinders  brand
##   4   2  mpg  cylinders  brand
##   4   3  mpg  cylinders  brand
##   5   1  mpg  cylinders  brand
##   5   2  mpg  cylinders  brand
##   5   3  mpg  cylinders  brand
identical(ii, jj)
## [1] FALSE
all.equal(ii, jj)
## [1] TRUE

Warning Logged Event

(External) MICE Package Author

  • The loggedEvents component of the ‘mids’ object is a data frame with five columns.
    • ‘it’ ‘im’ stand for iteration and imputation number
    • ‘dep’ contains the name of the target variable, and is left blank at initialization
    • ‘meth’ signals the type of problem
      • ‘constant’ : oversized representation of a single value
        • This also comes up if the column is “character” and not converted into “factor”
      • ‘collinear’: The column is duplicate of another column
      • ‘pmm’ : “ForLater” Unknown for now
    • ‘out’ contains the names of the removed variables.
    • In general, strive for zero entries, in which case the loggedEvent component is equal to NULL.
  • Guidance
    • Inspect all complete variables for forgotten missing data marks. Repair or remove these variables. Even one forgotten mark may ruin the imputation model. Remove outliers with improbable values.
    • Obtain insight into the strong and weak parts of the data by studying the influx-outflux pattern. Unless they are scientifically important, remove variables with low outflux, or with high fractions of missing data.
    • Perform a dry run with maxit=0 and inspect the logged events produced by mice(). Remove any constant and collinear variables before imputation.
    • Find out what will happen after the data have been imputed. Determine a set of variables that are important in subsequent analyses, and include these as predictors in all models. Transform variables to improve predictability and coherence in the complete-data model.
    • Run quickpred(), and determine values of mincor and minpuc such that the average number of predictors is around 25.
    • After imputation, determine whether the generated imputations are sensible by comparing them to the observed information, and to knowledge external to the data. Revise the model where needed.
    • Document your actions and decisions, and obtain feedback from the owner of the data.
# #Using the "character" to generate the Warning
ii <- aa
ii$mpg[2] <- ii$brand[3] <- ii$cylinders[4] <- NA
#
tryCatch(expr = {
  jj <- mice(ii[, 2:9], m = 1, seed = 3)
  }, warning = function(w) {
    print(paste0(w))
  })
## 
##  iter imp variable
##   1   1  mpg  cylinders
##   2   1  mpg  cylinders
##   3   1  mpg  cylinders
##   4   1  mpg  cylinders
##   5   1  mpg  cylinders
## [1] "simpleWarning: Number of logged events: 1\n"
#
# #Warning message: Number of logged events
# #It can occur because of variety of issues in the data
jj$loggedEvents
## NULL

8.6 Outliers

Refer Outliers: C03 and Outliers: B12

  • How do we detect and deal with outliers
    • Use Visualisations for detecting outliers

8.6.1 Histogram

Image

Cars: Histogram and Density of Weight (lbs)Cars: Histogram and Density of Weight (lbs)

Figure 8.2 Cars: Histogram and Density of Weight (lbs)

hist()

# Set up the plot area to visualise multiple 3 plots simultaneously
par(mfrow = c(1, 3))
# Create the histogram bars
hist(aa$weightlbs,
     breaks = 30,
     xlim = c(0, 5000),
     col = "blue",
     border = "black",
     ylim = c(0, 40),
     xlab = "Weight",
     ylab = "Counts",
     main = "Histogram of Car Weights")
# Make a box around # the plot
box(which = "plot",
    lty = "solid",
    col = "black")

Code Histogram

# #Histogram
#bb <- na.omit(xxflights$air_time)
hh <- tibble(ee = aa$weightlbs)
ttl_hh <- "Cars: Histogram of Weight"
cap_hh <- "B16P02"
# #Basics
median_hh <- round(median(hh[[1]]), 1)
mean_hh <- round(mean(hh[[1]]), 1)
sd_hh <- round(sd(hh[[1]]), 1)
len_hh <- nrow(hh)
#
B16 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) + 
  geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') + 
  geom_vline(aes(xintercept = mean_hh), color = '#440154FF') +
  geom_text(data = tibble(x = mean_hh, y = -Inf, 
                          label = paste0("Mean= ", mean_hh)), 
            aes(x = x, y = y, label = label), 
            color = '#440154FF', hjust = -0.5, vjust = 1.3, angle = 90) +
  geom_vline(aes(xintercept = median_hh), color = '#3B528BFF') +
  geom_text(data = tibble(x = median_hh, y = -Inf, 
                          label = paste0("Median= ", median_hh)), 
            aes(x = x, y = y, label = label), 
            color = '#3B528BFF', hjust = -0.5, vjust = -0.7, angle = 90) +
  theme(plot.title.position = "panel") + 
  labs(x = "x", y = "Frequency", 
       subtitle = paste0("(N=", len_hh, "; ", "Mean= ", mean_hh, 
                         "; Median= ", median_hh, "; SD= ", sd_hh,
                         ")"), 
        caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B16)
rm(B16)

Code Density

# #Density Curve
ttl_hh <- "Cars: Desnsity Plot of Weight"
cap_hh <- "B16P03"
# #Get Quantiles and Ranges of mean +/- sigma 
q05_hh <- quantile(hh[[1]], .05)
q95_hh <- quantile(hh[[1]], .95)
density_hh <- density(hh[[1]])
density_hh_tbl <- tibble(x = density_hh$x, y = density_hh$y)
sig3r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 3 * sd_hh})
sig3l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 3 * sd_hh})
sig2r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 2 * sd_hh}, {x < mean_hh + 3 * sd_hh})
sig2l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 2 * sd_hh}, {x > mean_hh - 3 * sd_hh})
sig1r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + sd_hh}, {x < mean_hh + 2 * sd_hh})
sig1l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - sd_hh}, {x > mean_hh - 2 * sd_hh})
sig0r_hh <- density_hh_tbl %>% filter(x > mean_hh, {x < mean_hh + 1 * sd_hh})
sig0l_hh <- density_hh_tbl %>% filter(x < mean_hh, {x > mean_hh - 1 * sd_hh})
#
# #Change x-Axis Ticks interval
xbreaks_hh <- seq(-3, 3)
xpoints_hh <- mean_hh + xbreaks_hh * sd_hh
# #Arrow
arr_y <- 0.0005 #mean(density_hh_tbl$y) #
arr_lst <- list(list("99.7%", xpoints_hh[1], xpoints_hh[7], arr_y),
                list("95.4%", xpoints_hh[2], xpoints_hh[6], arr_y),
                list("68.3%", xpoints_hh[3], xpoints_hh[5], arr_y))
arr_hh <- arr_lst[[1]]
#
# # Latex Labels 
xlabels_hh <- c(TeX(r'($\,\,\mu - 3 \sigma$)'), TeX(r'($\,\,\mu - 2 \sigma$)'), 
                TeX(r'($\,\,\mu - 1 \sigma$)'), TeX(r'($\mu$)'), TeX(r'($\,\,\mu + 1 \sigma$)'), 
                TeX(r'($\,\,\mu + 2 \sigma$)'), TeX(r'($\,\,\mu + 3\sigma$)'))
#
B16 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) + 
  geom_density(alpha = 0.2, colour = "#21908CFF") + 
  geom_area(data = sig3l_hh, aes(x = x, y = y), fill = '#440154FF') + 
  geom_area(data = sig3r_hh, aes(x = x, y = y), fill = '#440154FF') + 
  geom_area(data = sig2l_hh, aes(x = x, y = y), fill = '#3B528BFF') + 
  geom_area(data = sig2r_hh, aes(x = x, y = y), fill = '#3B528BFF') + 
  geom_area(data = sig1l_hh, aes(x = x, y = y), fill = '#21908CFF') + 
  geom_area(data = sig1r_hh, aes(x = x, y = y), fill = '#21908CFF') + 
  geom_area(data = sig0l_hh, aes(x = x, y = y), fill = '#5DC863FF') + 
  geom_area(data = sig0r_hh, aes(x = x, y = y), fill = '#5DC863FF') + 
  #scale_y_continuous(limits = c(0, 0.009), breaks = seq(0, 0.009, 0.003)) +
  scale_y_continuous(labels = function(n){format(n, scientific = FALSE)}) +
  scale_x_continuous(breaks = xpoints_hh, labels = xlabels_hh) + 
  annotate("segment", x = xpoints_hh[4] - 0.5 * sd_hh, xend = arr_hh[[2]], y = arr_hh[[4]], 
            yend = arr_hh[[4]], arrow = arrow(type = "closed", length = unit(0.02, "npc"))) + 
  annotate("segment", x = xpoints_hh[4] + 0.5 * sd_hh, xend = arr_hh[[3]], y = arr_hh[[4]], 
            yend = arr_hh[[4]], arrow = arrow(type = "closed", length = unit(0.02, "npc"))) + 
  annotate(geom = "text", x = xpoints_hh[4], y = arr_hh[[4]], label = arr_hh[[1]]) + 
  theme(plot.title.position = "panel") + 
  labs(x = "x", y = "Density", 
       subtitle = paste0("(N=", nrow(.), "; ", "Mean= ", round(mean(.[[1]]), 1), 
                         "; Median= ", round(median(.[[1]]), 1), "; SD= ", round(sd(.[[1]]), 1),
                         ")"), 
        caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B16)
rm(B16)

8.6.2 ScatterPlot

Image

Cars: Scatterplot of Weight (x) vs MPG (y) with and without the two outliersCars: Scatterplot of Weight (x) vs MPG (y) with and without the two outliers

Figure 8.3 Cars: Scatterplot of Weight (x) vs MPG (y) with and without the two outliers

plot()

# Create a Scatterplot
plot(aa$weightlbs,
     aa$mpg,
     xlim = c(0, 5000),
     ylim = c(0, 600),
     xlab = "Weight",
     ylab = "MPG",
     main = "Scatterplot of MPG by Weight",
     type = "p", #Points
     pch = 16,
     col = "blue")
#Add open black
# circles
points(aa$weightlbs,
       aa$mpg,
       type = "p",
       col = "black")

Code ScatterPlot

# #IN: hh$x, hh$y, ttl_hh, cap_hh, x_hh, y_hh
# #Define the formula for Trendline calculation
k_gg_formula <- y ~ x
#
B16 <- hh %>% { ggplot(data = ., aes(x = x, y = y)) + 
    geom_smooth(method = 'lm', formula = k_gg_formula, se = FALSE) +
    stat_poly_eq(aes(label = paste0("atop(", ..eq.label.., ", \n", ..rr.label.., ")")), 
                 formula = k_gg_formula, eq.with.lhs = "italic(hat(y))~`=`~",
                 eq.x.rhs = "~italic(x)", parse = TRUE) +
    geom_vline(aes(xintercept = round(mean(x), 3)), color = '#440154FF', linetype = "dashed") +
    geom_hline(aes(yintercept = round(mean(y), 3)), color = '#440154FF', linetype = "dashed") +
    geom_text(data = tibble(x = mean(.[["x"]]), y = -Inf, 
                            label = TeX(r'($\bar{x}$)', output = "character")), 
              aes(x = x, y = y, label = label), 
              size = 4, color = '#440154FF', hjust = 1.5, vjust = -1, parse = TRUE ) +
    geom_text(data = tibble(x = 0, y = mean(.[["y"]]), 
                            label = TeX(r'($\bar{y}$)', output = "character")), 
              aes(x = x, y = y, label = label), 
              size = 4, color = '#440154FF', hjust = 1.5, vjust = 1.5, parse = TRUE ) +
    geom_point() +
    k_gglayer_scatter +
    labs(x = x_hh, y = y_hh,
        #subtitle = TeX(r"(Trendline Equation, $R^{2}$, $\bar{x}$ and $\bar{y}$)"), 
        caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B16)
rm(B16)

8.6.3 BoxPlot

  • Interquartile (IQR) based approach for identification of Outliers
    • Refer Precentiles for Percentil, Quartile and IQR
    • IQR = Q3 - Q1
    • Any data point not in [Q1 - 1.5 * IQR, Q3 + 1.5 * IQR] is an outlier
  • Question: Why sometimes the y-axis shows range of ~500 and sometimes ~50
    • If the 1 extreme outlier of mpg column is kept then the axis will show upto 500
    • If the point is deleted from the data, then the axis values are in range of 50

Image

Cars: BoxPlot of MPG (excluding 1 point) vs. Cylinders (4, 6, 8)

Figure 8.4 Cars: BoxPlot of MPG (excluding 1 point) vs. Cylinders (4, 6, 8)

boxplot()

boxplot(mpg ~ cyl, data = aa, xlab = "Number of Cylinders",
        ylab = "Miles Per Gallon", main = "Mileage Data")

Code BoxPlot

# #BoxPlot
hh <- aa %>% select(mpg, cylinders) %>% filter(!cylinders %in% c(3, 5)) %>% 
  filter(mpg < max(mpg)) %>% mutate(across(cylinders, factor)) 
#
ttl_hh <- "BoxPlot of MPG (excluding 1 point) vs. Cylinders (4, 6, 8)"
cap_hh <- "B16P06"
x_hh <- "Cylinders"
y_hh <- "MPG"
#
B16 <- hh %>% { ggplot(data = ., mapping = aes(x = cylinders, y = mpg, fill = cylinders)) +
    geom_boxplot(outlier.shape = NA) +
    geom_point(position = position_jitterdodge(jitter.width = 0.1), size = 1, alpha = 0.7) + 
    k_gglayer_box +
    theme(legend.position = 'none') +
    labs(x = x_hh, y = y_hh, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B16)
rm(B16)

8.7 Numerical Methods for detecting Outliers

8.7.1 IQR Based

  • NOTE:
    • In ‘weightlbs’ there is no point outside IQR, thus no data point was eliminated
    • So specifically Cylinder 6 data points were selected for outlier detection and removal

All Values (No Outlier)

bb <- aa 
dim(bb)
## [1] 263   9
#
# #summary() or quantile()
summary(bb$weightlbs)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   192.5  2245.5  2830.0  2992.9  3654.5  4997.0
q_bb <- quantile(bb$weightlbs, probs = c(.25, .75), na.rm = TRUE)
q_bb
##    25%    75% 
## 2245.5 3654.5
#
iqr_bb <- IQR(bb$weightlbs)
iqr_bb
## [1] 1409
#
upp_bb <- q_bb[2] + 1.5 * iqr_bb 
low_bb <- q_bb[1] - 1.5 * iqr_bb 
#
kept_bb <- bb[bb$weightlbs >= low_bb & bb$weightlbs <= upp_bb, ]
if(nrow(bb) == nrow(kept_bb)) {
  cat(paste0("No Point was removed because none was outside the range.\n"))
  } else cat(paste0("Number of Points removed = ", nrow(bb) - nrow(kept_bb),"\n"))
## No Point was removed because none was outside the range.

Cylinder = 6 (1 Outlier)

bb <- aa %>% filter(cylinders == 6)
dim(bb)
## [1] 57  9
#
# #summary() or quantile()
summary(bb$weightlbs)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   192.5  2910.0  3121.0  3120.4  3415.0  3907.0
q_bb <- quantile(bb$weightlbs, probs = c(.25, .75), na.rm = TRUE)
q_bb
##  25%  75% 
## 2910 3415
#
iqr_bb <- IQR(bb$weightlbs)
iqr_bb
## [1] 505
#
upp_bb <- q_bb[2] + 1.5 * iqr_bb 
low_bb <- q_bb[1] - 1.5 * iqr_bb 
#
kept_bb <- bb[bb$weightlbs >= low_bb & bb$weightlbs <= upp_bb, ]
if(nrow(bb) == nrow(kept_bb)) {
  cat(paste0("No Point was removed because none was outside the range.\n"))
  } else cat(paste0("Number of Points removed = ", nrow(bb) - nrow(kept_bb),"\n"))
## Number of Points removed = 1

Z-score Standardisation

To be continued …

Validation


9 Data Preprocessing (B17, Oct-31)

9.1 Data

Please import the "B16-Cars2.csv".

9.2 Z-score Standardisation

18.20 The z-score, \({z_i}\), can be interpreted as the number of standard deviations \({x_i}\) is from the mean \({\overline{x}}\). It is associated with each \({x_i}\). The z-score is often called the standardized value or standard score.

  • (18.14) \(z_i = \frac{{x}_i - {\overline{x}}}{{s}}\)
    • \(z_i \notin \pm 3 \to z_i \in \text{Outlier}\)
    • After Standardisation, we can compare values of different orders because these would be scaled into a dimensionless quantity
    • While comparing the Eucledean distance between Two Variables like Age and Income, the standardisation allow these to be scaled to a similar range
      • Ex: If Age range is 20-30 but Income range is 10k-30k, directly using these values will ignore any impact of change in Age.
    • For some data mining algorithms, large differences in the ranges will lead to a tendency for the variable with greater range to have undue influence on the results. Therefore, these numeric variables should be normalised, in order to standardize the scale of effect each variable has on the results.
    • Scaling also benefits Neural networks, and algorithms that make use of distance measures, such as the k-nearest neighbors algorithm.
    • Note that Mean itself includes the impact of extreme values thus it is not very robust. IQR is better because it is based on position
    • Standardisation does not convert the non-normal data to normal. It does not change the shape of the data. Outliers remain outliers, Skewness remain in the data. It simply changes the Scale.
  • Question: Can we use different methods for outlier identification on different variables
    • Yes, you can remove outliers of one variable usig IQR and of another variable using standardisation
  • Question: From the Normalised values, can be convert back to original data
    • Tiring process
    • (Aside) The scales() function attaches mean and sd as attributes to the output matrix. That can be used to convert back the data.
  • scale()
    • It converts any vector to standard normal
  • (Aside)
    • Scaling is less effective if the outliers are present. Extremely low value (e.g. 192.5 Weight) or extremely high value (527 Mileage) are obviously going to impact the scale applied. Probably, it is better to follow: Scaling | Outlier Treatment (Identification, Removal or Imputation) | Scaling of original data with Outlier Treatment

Normalisation

# #Normalising Weight
bb <- aa %>% select(weightlbs) %>% mutate(z = as.vector(scale(weightlbs)))
str(bb)
## tibble [263 x 2] (S3: tbl_df/tbl/data.frame)
##  $ weightlbs: num [1:263] 4209 1925 3449 3761 2051 ...
##  $ z        : num [1:263] 1.402 -1.231 0.526 0.885 -1.086 ...
#
# #Excluding Outliers
kept_bb <- bb[bb$z >= -3 & bb$z <= 3, ]
str(kept_bb)
## tibble [262 x 2] (S3: tbl_df/tbl/data.frame)
##  $ weightlbs: num [1:262] 4209 1925 3449 3761 2051 ...
##  $ z        : num [1:262] 1.402 -1.231 0.526 0.885 -1.086 ...
#
# #Similarly with mpg
kept_bb <- aa %>% select(mpg) %>% mutate(z = as.vector(scale(mpg))) %>% filter(z >= -3 & z <= 3)
str(kept_bb)
## tibble [262 x 2] (S3: tbl_df/tbl/data.frame)
##  $ mpg: num [1:262] 14 31.9 17 15 30.5 23 13 14 25.4 37.7 ...
##  $ z  : num [1:262] -0.346 0.213 -0.252 -0.314 0.17 ...
summary(kept_bb)
##       mpg              z           
##  Min.   :10.00   Min.   :-0.47040  
##  1st Qu.:16.93   1st Qu.:-0.25421  
##  Median :22.00   Median :-0.09577  
##  Mean   :23.15   Mean   :-0.05981  
##  3rd Qu.:28.70   3rd Qu.: 0.11340  
##  Max.   :46.60   Max.   : 0.67222

scale()

# #scale(x, center = TRUE, scale = TRUE) output is Nx1 Matrix
bb <- aa %>% select(weightlbs) 
ii <- bb %>% mutate(z = as.vector(scale(weightlbs)))
#bb %>% mutate(z = across(weightlbs, scale)) #matrix
#bb %>% mutate(z = across(weightlbs, ~ as.vector(scale(.)))) #tibble
jj <- bb %>% mutate(across(weightlbs, list(z = ~ as.vector(scale(.))), .names = "{.fn}"))
kk <- bb
kk$z <- as.vector(scale(kk$weightlbs))
stopifnot(all(identical(ii, jj), identical(ii, kk)))

9.3 Min-Max Scaling

  • \(x^* = \frac{{x}_i - \text{min}(x)}{\text{range}(x)} = \frac{{x}_i - \text{min}(x)}{\text{max}(x) - \text{min}(x)} \to x^* \in [0, 1]\)
    • This is for scaling only, not for removal of outliers
# #Min-Max Scaling
min_aa <- min(aa$weightlbs)
max_aa <- max(aa$weightlbs)
bb <- aa %>% select(weightlbs) %>% mutate(z = {weightlbs - min_aa}/{max_aa - min_aa})
str(bb)
## tibble [263 x 2] (S3: tbl_df/tbl/data.frame)
##  $ weightlbs: num [1:263] 4209 1925 3449 3761 2051 ...
##  $ z        : num [1:263] 0.836 0.361 0.678 0.743 0.387 ...

9.4 Decimal Scaling

  • \(x^* = \frac{{x}_i}{10^d} \to x^* \in [-1, 1]\)
    • Where ‘d’ represents the number of digits in the largest absolute value i.e. if max(abs(x)) is 6997, d will be 4
# #Count Digits in Maximum (NOTE: Take care of NA, 0, [-1, 1] values)
d_bb <- 10^{floor(log10(max(abs(aa$weightlbs)))) + 1}
# #Decimal Scaling
bb <- aa %>% select(weightlbs) %>% mutate(z = weightlbs/d_bb)
str(bb)
## tibble [263 x 2] (S3: tbl_df/tbl/data.frame)
##  $ weightlbs: num [1:263] 4209 1925 3449 3761 2051 ...
##  $ z        : num [1:263] 0.421 0.192 0.345 0.376 0.205 ...

9.5 Comparison

Histogram

Cars: Histogram of Weight  (Original vs Scaled)Cars: Histogram of Weight  (Original vs Scaled)

Figure 9.1 Cars: Histogram of Weight (Original vs Scaled)

hist()

par(mfrow = c(1,2))
# Create two histograms
hist(bb$weightlbs, breaks = 20,
     xlim = c(1000, 5000),
     main = "Histogram of Weight",
     xlab = "Weight",
     ylab = "Counts")
box(which = "plot",
    lty = "solid",
    col = "black")
#
hist(bb$z,
     breaks = 20,
     xlim = c(-2, 3),
     main = "Histogram of Zscore
of Weight",
     xlab = "Z-score of Weight",
     ylab = "Counts")
box(which = "plot",
    lty = "solid",
    col = "black")

9.6 Skewness

  • Refer Skewness
  • Scaling does not change the skewness
# #Skewness
bb <- aa %>% select(weightlbs) %>% mutate(z = as.vector(scale(weightlbs)))
ii <- bb$weightlbs
#
3 * {mean(ii) - median(ii)} / sd(ii)
## [1] 0.5632797
#
ii <- bb$z
3 * {mean(ii) - median(ii)} / sd(ii)
## [1] 0.5632797

9.7 Non-linear Transformations

  • It is done for conversion of non-normal data to normal
    • Note that scaling is linear transformation
  • Transformations
    • Sqare Root - sqrt()
    • Log - log(), log10()
    • Inverse Sqare Root
bb <- aa %>% select(weightlbs) %>% 
  mutate(z = as.vector(scale(weightlbs)), Sqrt = sqrt(weightlbs),
         Log = log(weightlbs), InvSqr = 1/Sqrt)
#
# #Check Skewness
vapply(bb, function(x) round(3 * {mean(x) - median(x)} / sd(x), 3), numeric(1))
## weightlbs         z      Sqrt       Log    InvSqr 
##     0.563     0.563     0.339     0.086     0.145

Histogram

Cars: Weight Transformed with Original & Scaled

Figure 9.2 Cars: Weight Transformed with Original & Scaled

facet_wrap()

# #Histogram
bb <- aa %>% select(weightlbs) %>% 
  mutate(z = as.vector(scale(weightlbs)), Sqrt = sqrt(weightlbs), 
         Log = log(weightlbs), InvSqr = 1/Sqrt) %>% 
  pivot_longer(everything(), names_to = "Key", values_to = "Values") %>% 
  mutate(across(Key, factor, levels = c("Sqrt", "Log", "InvSqr", "weightlbs", "z"), 
    labels = c("Square Root", "Natural Log", "Inverse Square", 
               "Original Weight", "Scaled Weight")))
#
hh <- bb
mean_hh <- hh %>% group_by(Key) %>% summarize(Mean = mean(Values))
#
ttl_hh <- "Cars: Weight with Transformed values and Mean"
cap_hh <- "B17P03"
#
B17 <- hh %>% { ggplot(data = ., mapping = aes(Values)) + 
    geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') + 
    geom_vline(data = mean_hh, aes(xintercept = Mean), color = '#440154FF') +
      geom_text(data = mean_hh, aes(x = Mean, y = -Inf, label = paste0("Mean= ", f_pNum(Mean))), 
            color = '#440154FF', hjust = -0.5, vjust = 1.3, angle = 90) +
    facet_wrap(~Key, scales = 'free_x') +
    theme(plot.title.position = "panel") + 
    labs(x = "x", y = "Frequency", caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B17)
rm(B17)

f_pNum()

f_pNum <- function(x, digits = 2L) {
  # #Print Numbers
  # #round(), rounds to a number of decimal places
  # #signif() rounds to a specific number of significant places
  # #if(){} else if(){} else{} is NOT vectorised
  #ifelse(abs(x) < 0.0000001, 0*sign(x), ifelse(abs(x) > 1, round(x, digits), signif(x, digits + 1L)))
  ifelse(abs(x) < 0.0000001, 0*sign(x), floor(x) + signif(x %% 1, digits))
}

hist()

# #Histogram with Normal Distribution Overlay
par(mfrow=c(1,1))
hist(bb$InvSqr,
     breaks = 30,
     xlim=c(0.0125, 0.0275),
     col = "lightblue",
     prob = TRUE,
     border = "black",
     xlab="Inverse Square Root of Weight",
     ylab = "Counts",
     main = "Histogram of Inverse Square Root of Weight")
box(which = "plot",
    lty = "solid",
    col="black")
# #Overlay Normal density
lines(density(bb$InvSqr), col="red")

9.8 QQ Plot

Q–Q (quantile-quantile) plot is a probability plot for comparing two probability distributions by plotting their quantiles against each other. A point \((x, y)\) on the plot corresponds to one of the quantiles of the second distribution (y-coordinate) plotted against the same quantile of the first distribution (x-coordinate). If the two distributions being compared are similar, the points in the Q–Q plot will approximately lie on the line \(y = x\).

  • Refer figure 4.6
    • The QQ Plot shows whether the values are within the specified limits of Normal Curve

Image

Cars: QQ Plots of Transformed Weight

Figure 9.3 Cars: QQ Plots of Transformed Weight

Flipped Axis

Cars: QQ Plots of Transformed Weight

Figure 9.4 Cars: QQ Plots of Transformed Weight

Code

# #QQ Plot
bb <- aa %>% select(weightlbs) %>% 
  filter(weightlbs > min(weightlbs)) %>% 
  mutate(z = as.vector(scale(weightlbs)), Sqrt = sqrt(weightlbs), 
         Log = log(weightlbs), InvSqr = 1/Sqrt) %>% 
  pivot_longer(everything(), names_to = "Key", values_to = "Values") %>% 
  mutate(across(Key, factor, levels = c("Sqrt", "Log", "InvSqr", "weightlbs", "z"), 
    labels = c("Square Root", "Natural Log", "Inverse Square", 
               "Original Weight", "Scaled Weight")))
#
hh <- bb
#hh %>% group_by(Key) %>% summarize(Max = max(Values), Min = min(Values))
max_hh <- min_hh <- hh %>% group_by(Key) %>% summarise(Values = 0)
#
# #Modify Number of Y-Axis Major Gridlines for Horizontal Comparison
max_hh$Values  <- c(100, 8.55, 0.0300, 5000, 2.35) #c(72, 8.55, 0.0255, 5000, 2.35)
min_hh$Values  <- c(20, 7.35, 0.0135, 1500, -1.65) #c(40, 7.35, 0.0135, 1500, -1.65)
#
ttl_hh <- "QQ Plots of Transformed Weight"
sub_hh <- "Excluded 1 Outlier and Modified Y-axis for alignment"
cap_hh <- "B17P04"
#
B17 <- hh %>% { ggplot(., aes(sample = Values)) +
    stat_qq() +
    stat_qq_line() +
    geom_blank(data=max_hh, aes(y = Values)) +
    geom_blank(data=min_hh, aes(y = Values)) +
    facet_wrap(~Key, scales = 'free') +
    scale_x_continuous(limits = c(-3, 3)) + 
    #coord_flip() +
    labs(caption = cap_hh, subtitle = sub_hh, title = ttl_hh)
}
assign(cap_hh, B17)
rm(B17)

qqnorm()

# Normal Q-Q Plot
qqnorm(bb$InvSqr,
       datax = TRUE,
       col = "red",
       ylim = c(0.01, 0.03),
       main = "Normal
Q-Q Plot of Inverse Square Root of Weight")
qqline(bb$InvSqr,
       col = "blue",
       datax = TRUE)

9.9 Shapiro

25.5 The Shapiro-Wilk test is a test of normality. It tests the null hypothesis that a sample came from a normally distributed population. \(P_{\text{shapiro}} > ({\alpha} = 0.05) \to \text{Data is Normal}\). Avoid using sample with more than 5000 observations.

9.9.1 rnorm()

  • rnorm(n, mean = 0, sd = 1)
    • Create Vector of Random Numbers with given mean and sd
set.seed(3)
ii <- rnorm(n = 100, mean = 50, sd = 5.99)
#
# #Check Normality of randomly generated Normal dataset
shapiro.test(ii)
## 
##  Shapiro-Wilk normality test
## 
## data:  ii
## W = 0.97928, p-value = 0.1167
#
# #Check Normality of Weight
ii <- aa %>% select(weightlbs) %>% 
  #filter(weightlbs > min(weightlbs)) %>%
  mutate(z = as.vector(scale(weightlbs)), Sqrt = sqrt(weightlbs), 
         Log = log(weightlbs), InvSqr = 1/Sqrt) %>% 
  pivot_longer(everything(), names_to = "Key", values_to = "Values") %>% 
  mutate(across(Key, factor, levels = c("Sqrt", "Log", "InvSqr", "weightlbs", "z"), 
    labels = c("Square Root", "Natural Log", "Inverse Square", 
               "Original Weight", "Scaled Weight")))
#
# #No Transformation was able to convert the data to Normality 
# #Even after excluding 1 outlier (Not shown here)
ii %>% group_by(Key) %>% 
  summarise(p_Shapiro = shapiro.test(Values)$p.value, 
            isNormal = ifelse(p_Shapiro > 0.05, TRUE, FALSE))
## # A tibble: 5 x 3
##   Key             p_Shapiro isNormal
##   <fct>               <dbl> <lgl>   
## 1 Square Root      2.14e- 7 FALSE   
## 2 Natural Log      1.45e-14 FALSE   
## 3 Inverse Square   4.03e-25 FALSE   
## 4 Original Weight  6.81e- 7 FALSE   
## 5 Scaled Weight    6.81e- 7 FALSE

9.9.2 cut()

# #Continuous to Categorical (Bins)
cut_ii <- cut(aa$weightlbs, breaks = 3, dig.lab = 4, include.lowest = TRUE, ordered_result = TRUE)
levels(cut_ii)
## [1] "[187.7,1794]" "(1794,3396]"  "(3396,5002]"
#
# #ggplot2::cut_interval()
cut_jj <- cut_interval(aa$weightlbs, n = 3, dig.lab = 4, ordered_result = TRUE)
levels(cut_jj)
## [1] "[192.5,1794]" "(1794,3396]"  "(3396,4997]"
#
# #With Labels: NOTE default ordering is ascending
levels(cut(aa$weightlbs, breaks = 3, dig.lab = 4, include.lowest = TRUE, ordered_result = TRUE, 
           labels = c("low", "medium", "high")))
## [1] "low"    "medium" "high"
levels(cut_interval(aa$weightlbs, n = 3, dig.lab = 4, ordered_result = TRUE, 
                    labels = c("low", "medium", "high")))
## [1] "low"    "medium" "high"

9.10 Continuos to Categorical Groups

bb <- aa %>% select(weightlbs) %>% rename(Weight = 1)
#
# #Subsetting
# #Create Column explicitly to prevent Warning message: Unknown or uninitialised column: `ii`. 
bb$ii <- NA
bb$ii[bb$Weight >= 3000] <- 1
bb$ii[bb$Weight < 3000] <- 2
#
# #Using ifelse() or case_when()
bb <- bb %>% mutate(jj = ifelse(Weight >= 3000, 1, 2), 
                    kk = case_when(Weight >= 3000 ~ 1, Weight < 3000 ~ 2))
stopifnot(all(identical(bb$ii, bb$jj), identical(bb$ii, bb$kk)))

9.11 Index

# #Create Data
set.seed(3)
bb <- tibble(x = rnorm(n = 10, mean = 5, sd = 0.55), 
             y = rnorm(n = 10, mean = 4.5, sd = 0.66))
#
# #Basic Indexing
bb$i <- 1:nrow(bb)
# #Indexcan be started from anywhere. However it is not recommended.
bb$j <- 5:{nrow(bb) + 5L - 1L}
# 
# #Other Methods
bb$k <- seq_along(bb[[1]])
bb$l <- seq_len(nrow(bb))
bb$m <- seq.int(nrow(bb))
# #Note the placement of column at the beginning i.e. column index modified
bb <- cbind(n = 1:nrow(bb), bb)
bb <- rowid_to_column(bb, "o")
#
bb <- bb %>% mutate(p = row_number())
#
# #Excluding 'j' all other columns are equal. However, 'n' & 'o' modify column index
stopifnot(all(identical(bb$i, bb$k), identical(bb$i, bb$k), identical(bb$i, bb$l), 
  identical(bb$i, bb$m), identical(bb$i, bb$n), identical(bb$i, bb$o), identical(bb$i, bb$p)))

Validation


10 Unsupervised Learning (B18, Nov-07)

10.1 Overview

  • “Unsupervised Learning Algorithms”
    • “ForLater”
    • Case Analysis of JAT is Merged in notes of Case Study: JAT
    • NOTE: Discussion about Jupyter Notebook & Anaconda Navigator “15:45 - 16:05” is NOT covered because I am not working with it.
    • NOTE: Package ‘esquisse’ was not used because interactive is difficult to show in document format. “16:15 - 16:30”

10.2 Install

if(FALSE){# #WARNING: Installation may take some time.
  install.packages("esquisse", dependencies = TRUE)
}

10.3 Data: Churn

Please import the "B18-Churn.xlsx".

10.4 Q1

  • Explore whether there are missing values for any of the variables.
    • There are NO missing values

Transposed Churn

Table 10.1: (B18T02) Churn Transposed
Col_Row Row_1 Row_2 Row_3 Row_4 Row_5 Row_6
State KS OH NJ OH OK AL
Account Length 128 107 137 84 75 118
Area Code 415 415 415 408 415 510
Phone 382-4657 371-7191 358-1921 375-9999 330-6626 391-8027
Int’l Plan no no no yes yes yes
VMail Plan yes yes no no no no
VMail Message 25 26 0 0 0 0
Day Mins 265.1 161.6 243.4 299.4 166.7 223.4
Day Calls 110 123 114 71 113 98
Day Charge 45.07 27.47 41.38 50.9 28.34 37.98
Eve Mins 197.4 195.5 121.2 61.9 148.3 220.6
Eve Calls 99 103 110 88 122 101
Eve Charge 16.78 16.62 10.3 5.26 12.61 18.75
Night Mins 244.7 254.4 162.6 196.9 186.9 203.9
Night Calls 91 103 104 89 121 118
Night Charge 11.01 11.45 7.32 8.86 8.41 9.18
Intl Mins 10 13.7 12.2 6.6 10.1 6.3
Intl Calls 3 3 5 7 3 6
Intl Charge 2.7 3.7 3.29 1.78 2.73 1.7
CustServ Calls 1 1 0 2 3 0
Churn False. False. False. False. False. False.

Churn Data

Table 10.2: (B18T01) Churn
State Account Length Area Code Phone Int’l Plan VMail Plan VMail Message Day Mins Day Calls Day Charge Eve Mins Eve Calls Eve Charge Night Mins Night Calls Night Charge Intl Mins Intl Calls Intl Charge CustServ Calls Churn
KS 128 415 382-4657 no yes 25 265.1 110 45.07 197.4 99 16.78 244.7 91 11.01 10.0 3 2.70 1 False.
OH 107 415 371-7191 no yes 26 161.6 123 27.47 195.5 103 16.62 254.4 103 11.45 13.7 3 3.70 1 False.
NJ 137 415 358-1921 no no 0 243.4 114 41.38 121.2 110 10.30 162.6 104 7.32 12.2 5 3.29 0 False.
OH 84 408 375-9999 yes no 0 299.4 71 50.90 61.9 88 5.26 196.9 89 8.86 6.6 7 1.78 2 False.
OK 75 415 330-6626 yes no 0 166.7 113 28.34 148.3 122 12.61 186.9 121 8.41 10.1 3 2.73 3 False.
AL 118 510 391-8027 yes no 0 223.4 98 37.98 220.6 101 18.75 203.9 118 9.18 6.3 6 1.70 0 False.

NA

anyNA(bb)
## [1] FALSE

10.5 Q2

  • Compare the area code and state fields. Discuss any apparent abnormalities.
    • There are only 3 Area Codes and all of them belong to California State.

States Bar Chart

Churn: States Frequency

Figure 10.1 Churn: States Frequency

Exploration

# #Select | Rename 
bb <- aa %>% select(`Area Code`, State) %>% rename(Area = "Area Code") 
# #Select | Group | Frequency | Descending   
ii <- bb %>% select(State) %>% group_by(State) %>% summarise(CNT = n()) %>% arrange(desc(CNT)) %>% 
     mutate(across(State, factor, levels = rev(unique(State)), ordered = TRUE))

State & Area

ii <- bb %>% mutate(across(everything(), factor))
#
# #Unique Values
ii %>% summarise(across(everything(), ~ length(unique(.))))
## # A tibble: 1 x 2
##    Area State
##   <int> <int>
## 1     3    51
#
summary(ii)
##   Area          State     
##  408: 838   WV     : 106  
##  415:1655   MN     :  84  
##  510: 840   NY     :  83  
##             AL     :  80  
##             OH     :  78  
##             OR     :  78  
##             (Other):2824
#
str(levels(ii$Area))
##  chr [1:3] "408" "415" "510"
str(levels(ii$State))
##  chr [1:51] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DC" "DE" "FL" "GA" "HI" "IA" "ID" "IL" "IN" ...

Code Bar Chart

# #Proper Sorting of Factors for Flipped Axes
hh <- ii %>% mutate(nState = as.integer(State))
# #Because the CNT have duplicated values ggplot would add them up if used on x-axis
anyDuplicated(ii$CNT)
# #So, place it on Y-axis and then flip the axis
#
# #Set Alternate Labels as blanks on both Primary and Secondary x-axis
#x_sec <- x_prim <- as.character(hh$State)
#x_prim[1:nrow(hh) %%2 != 1] <- ""
#x_sec[1:nrow(hh) %%2 == 1] <- ""
#
# #Get Median Location
#hh %>% filter(CNT == median(CNT)) %>% mutate(as.integer(State))
median_loc_hh <- ceiling(nrow(hh)/2) 
#
cap_hh <- "B18P01"
ttl_hh <- "Churn: Frequency of States"
sub_hh <- paste0(nrow(hh), " States with Median = ",  median(hh$CNT)) #NULL
#
B18 <- hh %>% { ggplot(data = ., aes(x = nState, y = CNT)) +
    geom_bar(stat = 'identity', aes(fill = (nState %% 2 == 0))) + 
    geom_vline(aes(xintercept = median_loc_hh), color = '#440154FF') +
    scale_x_continuous( #sec.axis = sec_axis(~., breaks = 1:nrow(.), labels = rev(.$State)), 
      breaks = 1:nrow(.), guide = guide_axis(n.dodge = 2), labels = rev(.$State)) + 
    k_gglayer_bar + 
    coord_flip() +
    labs(x = "State", y = "Frequency", subtitle = sub_hh, 
         caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B18)
rm(B18)

10.6 Q3

  • Use useful graphs to visually determine whether there are any outliers in the datasets (Note: check the same for all the numeric variables).
    • Histograms
    • QQ Plots
    • Box Plots

10.6.1 All Histograms

Image

Churn: All Histograms

Figure 10.2 Churn: All Histograms

Code

ii <- bb %>% 
  select(where(is.numeric)) %>% 
  select(!area_code) %>% 
  relocate(ends_with("_mins")) %>% 
  relocate(ends_with("_calls")) %>% 
  relocate(vmail_message, .after =  last_col()) %>% 
  pivot_longer(everything(), names_to = "Key", values_to = "Values") %>% 
  mutate(across(Key, ~ factor(., levels = unique(Key))))
#
str(ii)
# #Histogram
hh <- ii
ttl_hh <- "Churn: Histograms"
cap_hh <- "B18P03"
#
B18 <- hh %>% { ggplot(data = ., mapping = aes(x = Values)) + 
    geom_histogram(bins = ifelse(length(unique(.[[1]])) > 50, 50, length(unique(.[[1]]))),
                   alpha = 0.4, fill = '#FDE725FF') + 
    theme(plot.title.position = "panel", 
          strip.text.x = element_text(size = 10, colour = "white")) +
    facet_wrap(~Key, nrow = 3, scales = 'free') +
    labs(x = "x", y = NULL, caption = cap_hh, subtitle = NULL, title = ttl_hh)
}
assign(cap_hh, B18)
rm(B18)

10.6.2 All QQ Plots

Image

Churn: All QQ Plots

Figure 10.3 Churn: All QQ Plots

Code

# #QQ Plots
hh <- ii
ttl_hh <- "Churn: QQ Plots"
cap_hh <- "B18P04"
#
B18 <- hh %>% { ggplot(., aes(sample = Values)) +
    stat_qq() +
    stat_qq_line() +
    facet_wrap(~Key, nrow = 3, scales = 'free_y') +
    #scale_x_continuous(limits = c(-3, 3)) + 
    #coord_flip() +
    theme(plot.title.position = "panel", 
          strip.text.x = element_text(size = 10, colour = "white")) +
    labs(x = "x", y = NULL, caption = cap_hh, subtitle = NULL, title = ttl_hh)
}
assign(cap_hh, B18)
rm(B18)

10.6.3 9 Box Plots

Image

Churn: BoxPlots of Calls, Minutes, & ChargesChurn: BoxPlots of Calls, Minutes, & ChargesChurn: BoxPlots of Calls, Minutes, & Charges

Figure 10.4 Churn: BoxPlots of Calls, Minutes, & Charges

Code

# #BoxPlot
B18 <- hh %>% { ggplot(data = ., mapping = aes(x = Key, y = Values, fill = Key)) +
    geom_boxplot() +
    k_gglayer_box +
    theme(legend.position = 'none') +
    labs(x = NULL, y = NULL, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B18)
rm(B18)

10.6.4 International Calls Box Plots

Image

Churn: BoxPlots of International Calls, Minutes, & ChargesChurn: BoxPlots of International Calls, Minutes, & ChargesChurn: BoxPlots of International Calls, Minutes, & Charges

Figure 10.5 Churn: BoxPlots of International Calls, Minutes, & Charges

Code

# #BoxPlot
B18 <- hh %>% { ggplot(data = ., mapping = aes(y = Values)) +
    geom_boxplot() +
    k_gglayer_box +
    theme(legend.position = 'none') +
    labs(x = NULL, y = NULL, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B18)
rm(B18)

10.6.5 3 Box Plots

Churn: BoxPlots of Reamining ThreeChurn: BoxPlots of Reamining ThreeChurn: BoxPlots of Reamining Three

Figure 10.6 Churn: BoxPlots of Reamining Three

10.6.6 International Calls

Image

Churn: International Calls

Figure 10.7 Churn: International Calls

Exploration

# #Rename to Proper Names | To Lower, Replace by Underscore | Coercion 
bb <- aa %>% rename_with(make.names) %>% 
  rename_with(~ tolower(gsub(".", "_", .x, fixed = TRUE))) %>% 
  mutate(across(c(int_l_plan, vmail_plan), ~case_when(. == "yes" ~ TRUE, . == "no" ~ FALSE))) %>% 
  mutate(across(churn, ~case_when(. == "True." ~ TRUE, . == "False." ~ FALSE))) %>% 
  mutate(across(ends_with("_calls"), as.integer))
#t(bb %>% summarise(across(everything(), ~length(unique(.)))))
#str(bb)
#summary(bb)

Code

# #Histogram
hh <- tibble(ee = bb$intl_calls)
ttl_hh <- "Churn: Histogram of International Calls"
cap_hh <- "B18P02"
# #Bins
summary(hh[[1]])
bins_hh <- ifelse(length(unique(hh[[1]])) > 50, 50, length(unique(hh[[1]])))
# #Basics
median_hh <- round(median(hh[[1]]), 1)
mean_hh <- round(mean(hh[[1]]), 1)
sd_hh <- round(sd(hh[[1]]), 1)
len_hh <- nrow(hh)
#
B18 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) + 
  geom_histogram(bins = bins_hh, alpha = 0.4, fill = '#FDE725FF') + 
  geom_vline(aes(xintercept = mean_hh), color = '#440154FF') +
  geom_text(data = tibble(x = mean_hh, y = -Inf, 
                          label = paste0("Mean= ", mean_hh)), 
            aes(x = x, y = y, label = label), 
            color = '#440154FF', hjust = -0.5, vjust = 1.3, angle = 90) +
  geom_vline(aes(xintercept = median_hh), color = '#3B528BFF') +
  geom_text(data = tibble(x = median_hh, y = -Inf, 
                          label = paste0("Median= ", median_hh)), 
            aes(x = x, y = y, label = label), 
            color = '#3B528BFF', hjust = -0.5, vjust = -0.7, angle = 90) +
  theme(plot.title.position = "panel") + 
  labs(x = "x", y = "Frequency", 
       subtitle = paste0("(N=", len_hh, "; ", "Mean= ", mean_hh, 
                         "; Median= ", median_hh, "; SD= ", sd_hh,
                         ")"), 
        caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B18)
rm(B18)

10.6.7 Scale

Histograms

Churn: All Histograms (Scaled)

Figure 10.8 Churn: All Histograms (Scaled)

QQ Plots

Churn: All QQ Plots (Scaled)

Figure 10.9 Churn: All QQ Plots (Scaled)

BoxPlots

Churn: All Box Plots (Scaled)

Figure 10.10 Churn: All Box Plots (Scaled)

10.6.8 Overlaid Histograms

Churn: All Histograms

Figure 10.11 Churn: All Histograms

10.7 Q4

  • Identify the outliers, using:
  • The Z-score method \(z \notin [-3, +3] \to \text{Outlier}\)
  • The IQR method
  • Shown Above

Validation


11 Supervised Learning (B19, Nov-14)

11.1 Overview

  • “Supervised Learning Algorithm: Cluster Analysis”
    • “ForLater”

Validation


K-means (B20, Nov-21)


Hierarchical Clustering (B21, Nov-28)


PCA (B22, Dec-05)


Cluster analysis (B23, Dec-12)


Association Rule (B24, Dec-19)

11.2 Overview

  • “Association Rule Mining”
    • “ForLater”

Validation


Regresssion (B25, Dec-26)

11.3 Overview

  • “Supervised Learning Algorithm: Regression”
    • “ForLater”

11.4 Linear Regression

Definition 11.1 Linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables.
  • \({Y}\): Scalar response, Dependent variable, Outcome variable, Target, Predicted

  • \({X}\): Explanatory variables, Independent variables, Antecendent variables, Predictors

  • It is applied when the objective is to predict the outcome variable based on the antecendent variables

  • Predicted can be continuous, but Predictors can be either continuous or categorical

    • Ex: Salary (Y) is a function of Age, Gender, Education, Years of Experience
    • Ex: Consumption (Y) is a function of Income (X)
    • We are interested in \(Y = mX + C\) where \({m}\) is the slope of the line and C is the y-intercept
    • Because there will be some error \({\epsilon}\), thus the equation is given by \(y = {\alpha} + {\beta}x + {\epsilon}\) where \({\alpha}\) is the average Y when X are zero.

Validation


12 Linear Regression (B26, Jan-02)

12.1 Overview

  • “Machine learning using linear regression”
    • “ForLater”

At least one chunk is throwing an error, probably the vif(). But it needs to be checked.

12.2 Data: Hospital

# #Object Names for each sheet
namesXL <- c("xxB26Hdesc", "xxB26Hraw", "xxB26Hmod")
# #Dimensions of these datasets
str(lapply(namesXL, function(x) {dim(eval(parse(text = x)))}))
## List of 3
##  $ : int [1:2] 58 2
##  $ : int [1:2] 248 24
##  $ : int [1:2] 248 51

12.3 Data: KC House

12.4 Data: Car Dekho

12.5 Cars

# #
bb <- aa <- xxB26CarDekho
str(bb)
# #To keep only the First Word of NAME Column. Warnings are Present
bb %>% separate(name, c("Brand", NA), sep = " ",remove = FALSE) 
# Select Relevant | Calculate Age
ii <- bb %>% select(-1, 4:7) %>% mutate(Age = 2021 - year)
#unique(ii$owner)
#anyNA(ii)
#
jj <- dummy_cols(ii, select_columns = c("seller_type", "fuel", "seller_type", "transmission", "owner"), remove_first_dummy = TRUE)
#
## Calculate AGE
## Step 1: Create the training and test data
# Create Training and Test data
set.seed(3)
train_row_ii <- sample(1:nrow(ii), 0.8*nrow(ii))  # row indices for training data: 80:20
train_ii <- ii[train_row_ii, ]  # model training data
test_ii  <- ii[-train_row_ii, ]   # test data
#
train_row_jj <- sample(1:nrow(jj), 0.8*nrow(jj))  # row indices for training data: 80:20
train_jj <- jj[train_row_jj, ]  # model training data
test_jj  <- jj[-train_row_jj, ]   # test data

This is a validation process. Train to Train Mode. Test to check model validity. If there is High Model Overfitting, then if I try another data, I will have discrepencies.

##Step 2: Fit the model on training data and predict selling_price

# Build the model on training data
# #Exclude 4th Column because, actually 14th, only 1 observation in original dataset that went to test not to the training test

which(names(train_jj) == "fuel_Electric")

#model1_ii <- lm(selling_price ~ ., data = train_ii[,-4])  # build the model
model1_jj <- lm(selling_price ~ ., data = train_jj[, -14])  # build the model
#
names(model1_jj)[which(!names(model1_jj) %in% c("residuals", "effects", "fitted.values", "qr", "model"))]
# #
#[1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values" "assign"        "qr" #[8] "df.residual"   "contrasts"     "xlevels"       "call"          "terms"         "model"  
#summary(model1_jj)
summary(model1_jj)$coefficients
#
# #We lose the stars doing this and the function needs to show 0 for very low values
f_pNum(summary(model1_jj)$coefficients)
#
#options(scipen=10)

#format(model1_jj$coefficients, scientific = FALSE)
#format(model1_jj$contrasts, scientific = FALSE)
#?options

Can we tell the model that at least two of each dummy variables should be included in the training dataset would it be considered as Biased model - do it in Catgortical Y not in continuos Y - if x is gender and out of 1000 observations 999 are male can we do anything with gender. Data itself is telling that gender is male overall - look at the significance

12.6 Psych

#model1_jj

#library(car)
# while making the model, please assure that there is no multicollinearity
# it means the IVs are not correlated in a high fashion.
#library (psych)

pairs.panels(train_jj[,-1], cex = 4)

vif(model1_jj)

#vif(model1)# vif stands for varience inflation factor, it is a measure of multicollinearity:
#if any variable carry vif greater than 5, which indicate that there is a multicollinearity issue
# #this does not work because we have dummy variables

#the linearly dependent variables
ld.vars <- attributes(alias(model1_jj)$Complete)$dimnames[[1]]

#remove the linearly dependent variables variables
formula.new <- as.formula(
    paste(
        paste(deparse(formula), collapse=""), 
        paste(ld.vars, collapse="-"),
        sep="-"
    )
)

#run model again
fit.new <-lm(formula.new)
vif(fit.new)

Validation


13 WIP (B27, Jan-09)

13.1 Overview

  • “Machine learning using linear regression”
    • “ForLater”

13.2 WIP

13.3 Normality

Non-Normal

13.4 Multicollinearity

13.5 Transformation

Normal Now (Prof got p value less than 0.05)

13.6 Glance

Validation


14 WIP (B28, Jan-16)

14.1 Overview

14.2 WIP

Validation


15 WIP (B29, Jan-23)

15.1 Overview

15.2 WIP

Validation


16 Data and Statistics

Definitions and Exercises are from the Book (David R. Anderson 2018)

16.1 Overview

16.2 Introduction

Definition 16.1 Data are the facts and figures collected, analysed, and summarised for presentation and interpretation.
Definition 16.2 Elements are the entities on which data are collected. (Generally ROWS)
Definition 16.3 A variable is a characteristic of interest for the elements. (Generally COLUMNS)
Definition 16.4 The set of measurements obtained for a particular element is called an observation.

Hence, the total number of data items can be determined by multiplying the number of observations by the number of variables.

Definition 16.5 Statistics is the art and science of collecting, analysing, presenting, and interpreting data.

16.3 Scales of Measurement

Data collection requires one of the following scales of measurement: nominal, ordinal, interval, or ratio.
Definition 16.6 The scale of measurement determines the amount of information contained in the data and indicates the most appropriate data summarization and statistical analyses.
Definition 16.7 When the data for a variable consist of labels or names used to identify an attribute of the element, the scale of measurement is considered a nominal scale.

For Example, Gender as Male and Female. In cases where the scale of measurement is nominal, a numerical code as well as a nonnumerical label may be used. For example, 1 denotes Male, 2 denotes Female. The scale of measurement is nominal even though the data appear as numerical values. Only Mode can be calculated.

Definition 16.8 The scale of measurement for a variable is considered an ordinal scale if the data exhibit the properties of nominal data and in addition, the order or rank of the data is meaningful.

For example, Size as small, medium, large. Along with the labels, similar to nominal data, the data can also be ranked or ordered, which makes the measurement scale ordinal. Ordinal data can also be recorded by a numerical code. Median can be calculated but not the Mean.

Definition 16.9 The scale of measurement for a variable is an interval scale if the data have all the properties of ordinal data and the interval between values is expressed in terms of a fixed unit of measure.

Interval data are always numerical. These can be ranked or ordered like ordinal. In addition, the differences between them are meaningful.

Definition 16.10 The scale of measurement for a variable is a ratio scale if the data have all the properties of interval data and the ratio of two values is meaningful.

Variables such as distance, height, weight, and time use the ratio scale of measurement. This scale requires that a zero value be included to indicate that nothing exists for the variable at the zero point. Mean can be calculated.

For example, consider the cost of an automobile. A zero value for the cost would indicate that the automobile has no cost and is free. In addition, if we compare the cost of 30,000 dollars for one automobile to the cost of 15,000 dollars for a second automobile, the ratio property shows that the first automobile is 30000/15000 = 2 times, or twice, the cost of the second automobile.

See Table 16.1 for more details.

16.3.1 Interval scale vs. Ratio scale

Interval scale is a measure of continuous quantitative data that has an arbitrary 0 reference point. This is contrasted with ratio scaled data which have a non-arbitrary 0 reference point. Ex: When we look at “profit” we see that negative profit does make sense to us. So while, the 0 for “profit” is meaningful (just like temperature measurements of Celsius) it is arbitrary. Therefore, it is Interval scale of measurement.

In an interval scale, you can take difference of two values. You may not be able to take ratios of two values. Ex: Temperature in Celsius. You can say that if temperatures of two places are 40 °C and 20 °C, then one is hotter than the other (taking difference). But you cannot say that first is twice as hot as the second (not allowed to take ratio).

In a ratio scale, you can take a ratio of two values. Ex: 40 kg is twice as heavy as 20 kg (taking ratios).

Also, “0” on ratio scale means the absence of that physical quantity. “0” on interval scale does not mean the same. 0 kg means the absence of weight. 0 °C does not mean absence of heat.

Table 16.1: (C01V01) Interval scale Vs Ratio scale
Features Interval scale Ratio scale
Variable property Addition and subtraction Multiplication and Division i.e. calculate ratios. Thus, you can leverage numbers on the scale against 0.
Absolute Point Zero Zero-point in an interval scale is arbitrary. For example, the temperature can be below 0 °C and into negative temperatures. The ratio scale has an absolute zero or character of origin. Height and weight cannot be zero or below zero.
Calculation Statistically, in an interval scale, the Arithmetic Mean is calculated. Statistical dispersion permits range and standard deviation. The coefficient of variation is not permitted. Statistically, in a ratio scale, the Geometric or Harmonic mean is calculated. Also, range and coefficient of variation are permitted for measuring statistical dispersion.
Measurement Interval scale can measure size and magnitude as multiple factors of a defined unit. Ratio scale can measure size and magnitude as a factor of one defined unit in terms of another.
Example Temperature in Celsius, Calendar years and time, Profit These possesses an absolute zero characteristic, like age, weight, height, or Sales

16.4 Categorical and Quantitative Data

Definition 16.11 Data that can be grouped by specific categories are referred to as categorical data. Categorical data use either the nominal or ordinal scale of measurement.
Definition 16.12 Data that use numeric values to indicate ‘how much’ or ‘how many’ are referred to as quantitative data. Quantitative data are obtained using either the interval or ratio scale of measurement.

If the variable is categorical, the statistical analysis is limited. We can summarize categorical data by counting the number of observations in each category or by computing the proportion of the observations in each category. However, even when the categorical data are identified by a numerical code, arithmetic operations do not provide meaningful results.

Arithmetic operations provide meaningful results for quantitative variables. For example, quantitative data may be added and then divided by the number of observations to compute the average value.

Quantitative data may be discrete or continuous.

Definition 16.13 Quantitative data that measure ‘how many’ are discrete.
Definition 16.14 Quantitative data that measure ‘how much’ are continuous because no separation occurs between the possible data values.

16.5 Cross-Sectional and Time Series Data

Definition 16.15 Cross-sectional data are data collected at the same or approximately the same point in time.
Definition 16.16 Time-series data data are data collected over several time periods.

16.6 Observational Study and Experiment

Definition 16.17 In an observational study we simply observe what is happening in a particular situation, record data on one or more variables of interest, and conduct a statistical analysis of the resulting data.
Definition 16.18 The key difference between an observational study and an experiment is that an experiment is conducted under controlled conditions.

As a result, the data obtained from a well-designed experiment can often provide more information as compared to the data obtained from existing sources or by conducting an observational study.

16.7 Caution

  1. Time and Cost - The cost of data acquisition and the subsequent statistical analysis should not exceed the savings generated by using the information to make a better decision.
  2. Data Acquisition Errors - An error in data acquisition occurs whenever the data value obtained is not equal to the true or actual value that would be obtained with a correct procedure. Ex: recording error, misinterpretation etc. Blindly using any data that happen to be available or using data that were acquired with little care can result in misleading information and bad decisions.

16.8 Descriptive Statistics

Definition 16.19 Most of the statistical information is summarized and presented in a form that is easy to understand. Such summaries of data, which may be tabular, graphical, or numerical, are referred to as descriptive statistics.

16.9 Population and Sample

Definition 16.20 A population is the set of all elements of interest in a particular study.
Definition 16.21 A sample is a subset of the population.
Definition 16.22 The measurable quality or characteristic is called a Population Parameter if it is computed from the population. It is called a Sample Statistic if it is computed from a sample.

Refer Sample For More …

16.10 Difference between a population and a sample

The population is the set of entities under study.

  • For example, the mean height of men. (Population “men,” parameter of interest “height”)
    • We choose the population that we wish to study.
    • Typically it is impossible to survey/measure the entire population because not all members are observable.
    • If it is possible to enumerate the entire population it is often costly to do so and would take a great deal of time.

Instead, we could take a subset of this population called a sample and use this sample to draw inferences about the population under study, given some conditions.

  • It is an inference because there will be some uncertainty and inaccuracy involved in drawing conclusions about the population based upon a sample.
    • In Simple Random Sampling (SRS) each member of the population has an equal probability of being included in the sample, hence the term “random.” There are many other sampling methods e.g. stratified sampling, cluster sampling, etc.

16.11 Statistical Inference

Definition 16.23 The process of conducting a survey to collect data for the entire population is called a census.
Definition 16.24 The process of conducting a survey to collect data for a sample is called a sample survey.
Definition 16.25 Statistics uses data from a sample to make estimates and test hypotheses about the characteristics of a population through a process referred to as statistical inference.

Whenever statisticians use a sample to estimate a population characteristic of interest, they usually provide a statement of the quality, or precision, associated with the estimate.

Inferential statistics are used for Hypothesis Testing.

  • It is often used to compare the differences between the treatment groups.
  • It uses measurements from the sample of subjects in the experiment to compare the treatment groups and make generalizations about the larger population of subjects.
  • Most inferential statistics are based on the principle that a test-statistic value is calculated on the basis of a particular formula.
    • That value along with the degrees of freedom, and the rejection criteria are used to determine whether differences exist between the treatment groups.
    • The larger the sample size, the more likely a statistic is to indicate that differences exist between the treatment groups.

The two most common types of Statistical Inference are -

  1. Confidence Intervals
    • To estimate a population parameter
  2. Test of Significance
    • To assess the evidence provided by data about some claim concerning a population
    • i.e. To compare observed data with a claim (Hypothesis)
    • The results of a significance test are expressed in terms of a probability that measures how well the data and the claim agree

Reasoning for Tests of Significance

  • Example: Is the sample mean \({\overline{x}}\) significantly different from population mean \({\mu}\)
  • To determine if two numbers are significantly different, a statistical test must be conducted to provide evidence

16.12 Analytics

Definition 16.26 Analytics is the scientific process of transforming data into insight for making better decisions.

Analytics is used for data-driven or fact-based decision making, which is often seen as more objective than alternative approaches to decision making. The tools of analytics can aid decision making by creating insights from data, improving our ability to more accurately forecast for planning, helping us quantify risk, and yielding better alternatives through analysis.
Analytics is now generally thought to comprise three broad categories of techniques. These categories are descriptive analytics, predictive analytics, and prescriptive analytics.

Definition 16.27 Descriptive analytics encompasses the set of analytical techniques that describe what has happened in the past.

Examples of these types of techniques are data queries, reports, descriptive statistics, data visualization, data dash boards, and basic what-if spreadsheet models.

Definition 16.28 Predictive analytics consists of analytical techniques that use models constructed from past data to predict the future or to assess the impact of one variable on another.

Linear regression, time series analysis, and forecasting models fall into the category of predictive analytics. Simulation, which is the use of probability and statistical computer models to better understand risk, also falls under the category of predictive analytics.

Prescriptive analytics differs greatly from descriptive or predictive analytics. What distinguishes prescriptive analytics is that prescriptive models yield a best course of action to take. That is, the output of a prescriptive model is a best decision.

Definition 16.29 Prescriptive analytics is the set of analytical techniques that yield a best course of action.

Optimization models, which generate solutions that maximize or minimize some objective subject to a set of constraints, fall into the category of prescriptive models.

16.13 Big Data and Data Mining

Definition 16.30 Larger and more complex data sets are now often referred to as big data.

Volume refers to the amount of available data; velocity refers to the speed at which data is collected and processed; and variety refers to the different data types. The term data warehousing is used to refer to the process of capturing, storing, and maintaining the data.

Definition 16.31 Data Mining deals with methods for developing useful decision-making information from large databases. It can be defined as the automated extraction of predictive information from (large) databases.

Data mining relies heavily on statistical methodology such as multiple regression, logistic regression, and correlation.

16.14 Exercises

  • Table: 16.2
  • Table: 16.3
  • Table: 16.4
    • Who appears to be the market share leader and how the market shares are changing over time
      • Caution: Trend Analysis should be done by linear regression with cor(), lm() etc.

R

Load Data

xxComputers <- f_getObject("xxComputers", "C01-Computers.csv", "971fb6096e4f71e8185d3327a9033a10")
xxCordless <- f_getObject("xxCordless", "C01-Cordless.csv", "9991f612fe44f1c890440bd238084679")

f_getObject()

f_getObject <- function(x_name, x_source, x_md = "") {
  # #Debugging
  a07bug <- FALSE
  # #Read the File or Object
  # #Ex: xxCars <- f_getObject("xxCars", "S16-cars2.csv", "30051fb47f65810f33cb992015b849cc")
  # #tools::md5sum("xx.csv") OR tools::md5sum(paste0(.z$XL, "xx", ".txt"))
  #
  # #Path to the File 
  loc_src <- paste0(.z$XL, x_source)
  # #Path to the Object
  loc_rds <- paste0(.z$XL, x_name, ".rds")
  #
  # #x_file[1] FILENAME & x_file[2] FILETYPE
  x_file <- strsplit(x_source, "[.]")[[1]]
  #
  if(all(x_md == tools::md5sum(loc_src),  file.exists(loc_rds),
        file.info(loc_src)$mtime < file.info(loc_rds)$mtime)) {
      # #Read RDS if (exists, newer than source, source not modified i.e. passes md5sum)
      if(a07bug) print("A07 Flag 01: Reading from RDS")
      return(readRDS(loc_rds))
  } else if(!file.exists(loc_src)){
      message("ERROR: File does not exist! : ", loc_src, "\n")
      stop()
  } else if(x_file[2] == "csv") {
      # #Read CSV as a Tibble
      # #col_double(), col_character(), col_logical(), col_integer()
      # #DATETIME (EXCEL) "YYYY-MM-DD HH:MM:SS" imported as "UTC"
      tbl <- read_csv(loc_src, show_col_types = FALSE)
      # #Remove Unncessary Attributes
      attr(tbl, "spec") <- NULL
      attr(tbl, "problems") <- NULL
      # #Write Object as RDS
      saveRDS(tbl, loc_rds)
      # #Return Object
      if(a07bug) print("A07 Flag 02: Reading from Source and Saving as RDS")
      return(tbl)
  } else if(x_file[2] == "xlsx") {
      # #Read All Sheets of Excel in a list
      tbl <- lapply(excel_sheets(loc_src), read_excel, path = loc_src)
      # #Write Object as RDS
      saveRDS(tbl, loc_rds)
      # #Return Object
      return(tbl)
  } else {
      message("f_getObject(): UNKNOWN")
      stop()
  }
}

Transpose Tibble

bb <- tibble(Company = c("Hertz", "Dollar", "Avis"), 
              `2007` = c(327, 167, 204), `2008` = c(311, 140, 220),
              `2009` = c(286, 106, 300), `2010` = c(290, 108, 270))
# #Transpose Tibble: Note that the First Column Header is lost after Transpose
# #Longer
bb %>% pivot_longer(!Company, names_to = "Year", values_to = "Values")
## # A tibble: 12 x 3
##    Company Year  Values
##    <chr>   <chr>  <dbl>
##  1 Hertz   2007     327
##  2 Hertz   2008     311
##  3 Hertz   2009     286
##  4 Hertz   2010     290
##  5 Dollar  2007     167
##  6 Dollar  2008     140
##  7 Dollar  2009     106
##  8 Dollar  2010     108
##  9 Avis    2007     204
## 10 Avis    2008     220
## 11 Avis    2009     300
## 12 Avis    2010     270
# #Transpose
(ii <- bb %>% 
  pivot_longer(!Company, names_to = "Year", values_to = "Values") %>% 
  pivot_wider(names_from = Company, values_from = Values))
## # A tibble: 4 x 4
##   Year  Hertz Dollar  Avis
##   <chr> <dbl>  <dbl> <dbl>
## 1 2007    327    167   204
## 2 2008    311    140   220
## 3 2009    286    106   300
## 4 2010    290    108   270
# #Equivalent
stopifnot(identical(ii, 
                    bb %>% pivot_longer(-1) %>% 
                      pivot_wider(names_from = 1, values_from = value) %>% 
                      rename(., Year = name)))

Computers

Table 16.2: (C01T02) xxComputers
SN tablet cost os display_inch battery_hh cpu
1 Acer Iconia W510 599 Windows 10.1 8.5 Intel
2 Amazon Kindle Fire HD 299 Android 8.9 9.0 TI OMAP
3 Apple iPad 4 499 iOS 9.7 11.0 Apple
4 HP Envy X2 860 Windows 11.6 8.0 Intel
5 Lenovo ThinkPad Tablet 668 Windows 10.1 10.5 Intel
6 Microsoft Surface Pro 899 Windows 10.6 4.0 Intel
7 Motorola Droid XYboard 530 Android 10.1 9.0 TI OMAP
8 Samsung Ativ Smart PC 590 Windows 11.6 7.0 Intel
9 Samsung Galaxy Tab 525 Android 10.1 10.0 Nvidia
10 Sony Tablet S 360 Android 9.4 8.0 Nvidia

Mean

# #What is the average cost for the tablets #$582.90
cat(paste0("Avg. Cost for the tablets is = $", round(mean(bb$cost), digits = 1), "\n"))
## Avg. Cost for the tablets is = $582.9
#
# #Compare the average cost of tablets with different OS (Windows /Android) #$723.20 $428.5
(ii <- bb %>%
  group_by(os) %>%
  summarise(Mean = round(mean(cost), digits =1)) %>%
  arrange(desc(Mean)) %>% 
    mutate(Mean = paste0("$", Mean)))
## # A tibble: 3 x 2
##   os      Mean  
##   <chr>   <chr> 
## 1 Windows $723.2
## 2 iOS     $499  
## 3 Android $428.5
#
cat(paste0("Avg. Cost of Tablets with Windows OS is = ", 
  ii %>% filter(os == "Windows") %>% select(Mean), "\n"))
## Avg. Cost of Tablets with Windows OS is = $723.2

Percentage

# #What percentage of tablets use an Android operating system #40%
(ii <- bb %>%
  group_by(os) %>%
  summarise(PCT = n()) %>%
  mutate(PCT = 100 * PCT / sum(PCT)) %>% 
  arrange(desc(PCT)) %>% 
  mutate(PCT = paste0(PCT, "%")))
## # A tibble: 3 x 2
##   os      PCT  
##   <chr>   <chr>
## 1 Windows 50%  
## 2 Android 40%  
## 3 iOS     10%
#
cat(paste0("Android OS is used in ", 
  ii %>% filter(os == "Android") %>% select(PCT), " Tablets\n"))
## Android OS is used in 40% Tablets

Cordless

Table 16.3: (C01T03) xxCordless
SN brand model price overall_score voice_quality handset_on_base talk_time_hh
1 AT&T CL84100 60 73 Excellent Yes 7
2 AT&T TL92271 80 70 Very Good No 7
3 Panasonic 4773B 100 78 Very Good Yes 13
4 Panasonic 6592T 70 72 Very Good No 13
5 Uniden D2997 45 70 Very Good No 10
6 Uniden D1788 80 73 Very Good Yes 7
7 Vtech DS6521 60 72 Excellent No 7
8 Vtech CS6649 50 72 Very Good Yes 7

Mean

# #What is the average price for the cordless telephones 
cat(paste0("Avg. Price is = $", round(mean(bb$price), digits = 1), "\n"))
## Avg. Price is = $68.1
#
# #What is the average talk time for the cordless telephones
cat(paste0("Avg. Talk Time is = ", round(mean(bb$talk_time_hh), digits = 1), " Hours \n"))
## Avg. Talk Time is = 8.9 Hours

Percentage

# #What percentage of the cordless telephones have a voice quality of excellent 
(hh <- bb %>%
  group_by(voice_quality) %>%
  summarise(PCT = n()) %>%
  mutate(PCT = 100 * PCT / sum(PCT)) %>% 
    mutate(voice_quality = factor(voice_quality, 
                                  levels = c("Very Good", "Excellent"), ordered = TRUE)) %>% 
    arrange(desc(voice_quality)) %>% 
    mutate(PCT = paste0(PCT, "%")))
## # A tibble: 2 x 2
##   voice_quality PCT  
##   <ord>         <chr>
## 1 Excellent     25%  
## 2 Very Good     75%
#
cat(paste0("Percentage of 'Excellent' Voice Quality is = ", 
  hh %>% filter(voice_quality == "Excellent") %>% select(PCT), "\n"))
## Percentage of 'Excellent' Voice Quality is = 25%
#
# #Equivalent
print(bb %>%
 group_by(voice_quality) %>%
 summarise(PCT = n()) %>%
 mutate(PCT = prop.table(PCT) * 100))
## # A tibble: 2 x 2
##   voice_quality   PCT
##   <chr>         <dbl>
## 1 Excellent        25
## 2 Very Good        75

PCT 2

# #What percentage of the cordless telephones have a handset on the base 
bb %>%
  group_by(handset_on_base) %>%
  summarise(PCT = n()) %>%
  mutate(PCT = 100 * PCT / sum(PCT)) %>% 
  arrange(desc(PCT)) %>% 
  mutate(PCT = paste0(PCT, "%")) %>%
  filter(handset_on_base == "Yes") 
## # A tibble: 1 x 2
##   handset_on_base PCT  
##   <chr>           <chr>
## 1 Yes             50%

Cars

Transform

Table 16.4: (C01T04) Cars in Service
Company 2007 2008 2009 2010
Hertz 327 311 286 290
Dollar 167 140 106 108
Avis 204 220 300 270
Table 16.4: (C01T04B) Cars (Transposed)
Year Hertz Dollar Avis
2007 327 167 204
2008 311 140 220
2009 286 106 300
2010 290 108 270
bb <- tibble(Company = c("Hertz", "Dollar", "Avis"), 
              `2007` = c(327, 167, 204), `2008` = c(311, 140, 220),
              `2009` = c(286, 106, 300), `2010` = c(290, 108, 270))
# #Transpose Tibble: Note that the First Column Header is lost after Transpose
# #Longer
hh <- bb %>% pivot_longer(!Company, names_to = "Year", values_to = "Values")
# #Transpose
ii <- bb %>% 
  pivot_longer(!Company, names_to = "Year", values_to = "Values") %>% 
  pivot_wider(names_from = Company, values_from = Values)

TimeSeries

# #Save an Image
loc_png <- paste0(.z$PX, "C01P01", "-Cars-TimeSeries", ".png")
ggsave(loc_png, plot = C01P01, device = "png", dpi = 144) 
# #Load an Image
knitr::include_graphics(paste0(.z$PX, "C01P01", "-Cars-TimeSeries", ".png"))
Multiple Time Series Graph

Figure 16.1 Multiple Time Series Graph

Rowwise

# #who appears to be the market share leader
# #how the market shares are changing over time
print(ii)
## # A tibble: 4 x 4
##   Year  Hertz Dollar  Avis
##   <chr> <dbl>  <dbl> <dbl>
## 1 2007    327    167   204
## 2 2008    311    140   220
## 3 2009    286    106   300
## 4 2010    290    108   270
# #Row Total
jj <- ii %>% rowwise() %>% mutate(SUM = sum(c_across(where(is.numeric)))) %>% ungroup()
kk <- ii %>% mutate(SUM = rowSums(across(where(is.numeric))))
stopifnot(identical(jj, kk))
#
# #Rowwise Percentage Share 
ii %>% 
  rowwise() %>% 
  mutate(SUM = sum(c_across(where(is.numeric)))) %>% 
  ungroup() %>%
  mutate(across(2:4, ~ round(. * 100 / SUM, digits = 1), .names = "{.col}.{.fn}")) %>%
  mutate(across(ends_with(".1"), ~ paste0(., "%")))
## # A tibble: 4 x 8
##   Year  Hertz Dollar  Avis   SUM Hertz.1 Dollar.1 Avis.1
##   <chr> <dbl>  <dbl> <dbl> <dbl> <chr>   <chr>    <chr> 
## 1 2007    327    167   204   698 46.8%   23.9%    29.2% 
## 2 2008    311    140   220   671 46.3%   20.9%    32.8% 
## 3 2009    286    106   300   692 41.3%   15.3%    43.4% 
## 4 2010    290    108   270   668 43.4%   16.2%    40.4%

Pareto

# #Bar Plot
aa <- bb %>% 
  select(Company, `2010`) %>% 
  rename("Y2010" = `2010`) %>% 
  arrange(desc(.[2])) %>% 
  mutate(cSUM = cumsum(Y2010)) %>%
  mutate(PCT = 100 * Y2010 / sum(Y2010)) %>% 
  mutate(cPCT = 100 * cumsum(Y2010) / sum(Y2010)) %>% 
  mutate(across(Company, factor, levels = unique(Company), ordered = TRUE))
# #
pareto_chr <- setNames(c(aa$Y2010), aa$Company)
stopifnot(identical(pareto_chr, aa %>% pull(Y2010, Company)))
stopifnot(identical(pareto_chr, aa %>% select(1:2) %>% deframe()))
# #Plot Pareto
C01P02 <- pareto.chart(pareto_chr, 
             xlab = "Company", ylab = "Cars",
             # colors of the chart             
             #col=heat.colors(length(pareto_chr)), 
             # ranges of the percentages at the right
             cumperc = seq(0, 100, by = 20),  
             # label y right
             ylab2 = "Cumulative Percentage", 
             # title of the chart
             main = "Pareto Chart"
)
Pareto of Cars in 2010

Figure 16.2 Pareto of Cars in 2010

Validation


17 Descriptive Statistics

17.1 Overview

17.2 Summarizing Data for a Categorical Variable

Definition 17.1 A frequency distribution is a tabular summary of data showing the number (frequency) of observations in each of several non-overlapping categories or classes.

The relative frequency of a class equals the fraction or proportion of observations belonging to a class i.e. it is out of 1 whereas ‘percent frequency’ is out of 100%.

Rather than showing the frequency of each class, the cumulative frequency distribution shows the number of data items with values less than or equal to the upper class limit of each class.

  • Bar Chart
    • Pareto Chart - ggplot() does not allow easy setup of dual axis
    • Stacked Bar Chart - do not use it if there are more than 2 categories
  • Pie Chart
    • Only use it if total is 100% and Categories are fewer than 5 or 6.

Bar & Pie

Table 17.1: (C02T02) Frequency Distribution
softdrink Frequency cSUM PROP PCT cPCT
Coca-Cola 19 19 38 38% 38%
Pepsi 13 32 26 26% 64%
Diet Coke 8 40 16 16% 80%
Dr. Pepper 5 45 10 10% 90%
Sprite 5 50 10 10% 100%
Bar Chart and Pie Chart of FrequencyBar Chart and Pie Chart of Frequency

Figure 17.1 Bar Chart and Pie Chart of Frequency

Data

# #Frequency Distribution
aa <- tibble(softdrink = c("Coca-Cola", "Diet Coke", "Dr. Pepper", "Pepsi", "Sprite"), 
             Frequency = c(19, 8, 5, 13, 5))
#
# #Sort, Cummulative Sum, Percentage, and Cummulative Percentage
bb <- aa %>% 
  arrange(desc(Frequency)) %>% 
  mutate(cSUM = cumsum(Frequency)) %>%
  mutate(PROP = 100 * Frequency / sum(Frequency)) %>% 
  mutate(PCT = paste0(PROP, "%")) %>% 
  mutate(cPCT = paste0(100 * cumsum(Frequency) / sum(Frequency), "%"))

Bar

# #Sorted Bar Chart of Frequencies (Needs x-axis as Factor for proper sorting)
C02P01 <- bb %>% mutate(across(softdrink, factor, levels = rev(unique(softdrink)))) %>% {
  ggplot(data = ., aes(x = softdrink, y = Frequency)) +
  geom_bar(stat = 'identity', aes(fill = softdrink)) + 
  scale_y_continuous(sec.axis = sec_axis(~ (. / sum(bb$Frequency))*100, name = "Percentages", 
                       labels = function(b) { paste0(round(b, 0), "%")})) +
  geom_text(aes(label = paste0(Frequency, "\n(", PCT, ")")), vjust = 2, 
            colour = c(rep("black", 2), rep("white", nrow(bb)-2))) +
  k_gglayer_bar +   
  labs(x = "Soft Drinks", y = "Frequency", subtitle = NULL, 
         caption = "C02P01", title = "Bar Chart of Categorical Data")
}

Pie

# #Pie Chart of Frequencies (Needs x-axis as Factor for proper sorting)
C02P02 <- bb %>% mutate(across(softdrink, factor, levels = unique(softdrink))) %>% {
  ggplot(data = ., aes(x = '', y = Frequency, fill = rev(softdrink))) +
  geom_bar(stat = 'identity', width = 1, color = "white") +
  coord_polar(theta = "y", start = 0) +
  geom_text(aes(label = paste0(softdrink, "\n", Frequency, " (", PCT, ")")), 
            position = position_stack(vjust = 0.5), 
            colour = c(rep("black", 2), rep("white", nrow(bb)-2))) +
  k_gglayer_pie +   
  labs(caption = "C02P02", title = "Pie Chart of Categorical Data")
}

f_theme_gg()

f_theme_gg <- function(base_size = 14) {
# #Create a Default Theme 
  theme_bw(base_size = base_size) %+replace%
    theme(
      # #The whole figure
      plot.title = element_text(size = rel(1), face = "bold", 
                                margin = margin(0,0,5,0), hjust = 0),
      # #Area where the graph is located
      panel.grid.minor = element_blank(),
      panel.border = element_blank(),
      # #The axes
      axis.title = element_text(size = rel(0.85), face = "bold"),
      axis.text = element_text(size = rel(0.70), face = "bold"),
#      arrow = arrow(length = unit(0.3, "lines"), type = "closed"),
      axis.line = element_line(color = "black"),
      # The legend
      legend.title = element_text(size = rel(0.85), face = "bold"),
      legend.text = element_text(size = rel(0.70), face = "bold"),
      legend.key = element_rect(fill = "transparent", colour = NA),
      legend.key.size = unit(1.5, "lines"),
      legend.background = element_rect(fill = "transparent", colour = NA),
      # Labels in the case of facetting
      strip.background = element_rect(fill = "#17252D", color = "#17252D"),
      strip.text = element_text(size = rel(0.85), face = "bold", color = "white", margin = margin(5,0,5,0))
    )
}
# #Change default ggplot2 theme 
theme_set(f_theme_gg()) 
#
# #List of Specific sets. Note '+' is replaced by ','
k_gglayer_bar <- list(
  scale_fill_viridis_d(),
  theme(panel.grid.major.x = element_blank(), axis.line = element_blank(),
        panel.border = element_rect(colour = "black", fill=NA, size=1),
        legend.position = 'none', axis.title.y.right = element_blank())
)
#
# #Pie
k_gglayer_pie <- list(
  scale_fill_viridis_d(),
  #theme_void(),
  theme(#panel.background = element_rect(fill = "white", colour = "white"),
        #plot.background = element_rect(fill = "white",colour = "white"),
        axis.line = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank(),
        #panel.border = element_rect(colour = "black", fill=NA, size=1),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position = 'none')
)
#
# #Histogram
k_gglayer_hist <- list(
  scale_fill_viridis_c(direction = -1, alpha = 0.9),
  theme(panel.grid.major.x = element_blank(), axis.line.y = element_blank(),
        panel.border = element_blank(), axis.ticks.y = element_blank(), 
        legend.position = 'none')
)
#
# #Scatter Plot Trendline
k_gglayer_scatter <- list(
  scale_fill_viridis_d(alpha = 0.9),
  theme(panel.grid.minor = element_blank(),
        panel.border = element_blank())
)
#
# #BoxPlot
k_gglayer_box <- list(
  scale_fill_viridis_d(alpha = 0.9),
  theme(panel.grid.major = element_line(colour = "#d3d3d3"),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank(), panel.grid.major.x = element_blank(),
        #plot.title = element_text(size = 14, family = "Tahoma", face = "bold"),
        #text=element_text(family = "Tahoma"),
        #axis.title = element_text(face="bold"),
        #axis.text.x = element_text(colour="black", size = 11),
        #axis.text.y = element_text(colour="black", size = 9),
        axis.line = element_line(size=0.5, colour = "black"))
)
#

Errors

ERROR 17.1 Error: stat_count() can only have an x or y aesthetic.

Solution: Use geom_bar(stat = "identity")

17.3 Summarizing Data for a Quantitative Variable

A histogram is used for continuous data, where the bins represent ranges of data, while a bar chart is a plot of categorical variables.

The three steps necessary to define the classes for a frequency distribution with quantitative data are

  1. Determine the number of nonoverlapping classes (Bins)
    • Classes are formed by specifying ranges that will be used to group the data.
    • Approx. 5-20
  2. Determine the width of each class
    • The bins are usually specified as consecutive, non-overlapping intervals of a variable.
    • The bins (intervals) must be adjacent and are often (but not required to be) of equal size.
    • Approx. Bin Width = (Max - Min) / Number of Bins
    • Ex: For a dataset with min =12 & max =33, 5 bins of 10-14, …, 30-34 can be selected
  3. Determine the class
    • Class limits must be chosen so that each data item belongs to one and only one class
    • For categorical data, this was not required because each item naturally fell into a separate class
    • But with quantitative data, class limits are necessary to determine where each data value belongs
    • The ‘class midpoint’ is the value halfway between the lower and upper class limits. For a Bin of 10-14, 12 will be its mid-point.
  • Dot Plot
    • A horizontal axis shows the range for the data. Each data value is represented by a dot placed above the axis.
    • Caution: Avoid! Y-Axis is deceptive.
  • Histogram
    • Unlike a bar graph, a histogram contains no natural separation between the rectangles of adjacent classes.
  • Stem-and-Leaf Display (Not useful)

Histogram

set.seed(3)
# #Get Normal Data
bb <- tibble(aa = rnorm(n = 10000)) 
# #Histogram
# # '..count..' or '..x..'
C02P03 <- bb %>% {
  ggplot(data = ., aes(x = aa, fill = ..count..)) + 
  geom_histogram(bins = 50, position = "identity") +    
  k_gglayer_hist +
  labs(x = "Normal Data", y = "Count", subtitle = paste0("n = ", format(nrow(.), big.mark = ",")), 
       caption = "C02P03", title = "Histogram")
}
geom_histogram(): Histogram

Figure 17.2 geom_histogram(): Histogram

Dot Plot

# #Random Data
aa <- c(26, 35, 22, 47, 37, 5, 50, 49, 42, 2, 8, 7, 4, 47, 44, 35, 17, 49, 1, 48, 
        1, 27, 13, 26, 18, 44, 31, 4, 23, 47, 38, 28, 28, 5, 35, 39, 29, 13, 17, 
        38, 1, 8, 3, 30, 18, 37, 29, 39, 7, 28)
bb <- tibble(aa)
# #Dot Chart of Frequencies
C02P04 <- bb %>% {
  ggplot(., aes(x = aa)) +
  geom_dotplot(binwidth = 5, method = "histodot") + 
  theme(axis.line.y = element_blank(), panel.grid = element_blank(), axis.text.y = element_blank(),
        axis.ticks.y = element_blank(), axis.title.y =  element_blank()) + 
  labs(x = "Bins", subtitle = "Caution: Avoid! Y-Axis is deceptive.", 
       caption = "C02P04", title = "Dot Plot")
}
geom_dotplot(): Frequency Dot Chart

Figure 17.3 geom_dotplot(): Frequency Dot Chart

Get Frequency

17.4 Summarizing Data for Two Variables Using Tables

Definition 17.2 A crosstabulation is a tabular summary of data for two variables. It is used to investigate the relationship between them. Generally, one of the variable is categorical.
  • Simpson Paradox
    • The reversal of conclusions based on aggregate and unaggregated data is called Simpson paradox.
    • Ex: Table 17.2 shows the count of judgements that were ‘upheld’ or ‘reversed’ on appeal for two judges
      • 86% of the verdicts were upheld for Judge Abel, while 88% of the verdicts were upheld for Judge Ken. From this aggregated crosstabulation, we conclude that Judge Ken is doing the better job because a greater percentage of his verdicts are being upheld.
      • However, unaggregated crosstabulations show that in both types of courts (Common, Municipal) Judge Abel has higher percentage of ‘Upheld’ Verdicts (90.6% and 84.7%) - compared to Judge Ken (90% and 80%)
      • Thus, Abel has a better record because a greater percentage of his verdicts are being upheld in both courts.
      • This reversal of conclusions based on aggregated and unaggregated data illustrates Simpson paradox.
    • Cause
      • Note that for both judges the percentage of appeals that resulted in reversals was much higher in ‘Municipal’ than in ‘Common’ Court i.e. 15.3% vs. 9.4% for Abel and 20% vs. 10% for Ken.
      • Because Judge Abel tried a much higher percentage of his cases in ‘Municipal,’ the aggregated data favoured Judge Ken i.e. 118/150 for Abel vs. 25/125 for Ken.
      • Thus, for the original crosstabulation, we see that the ‘type of court’ is a hidden variable that cannot be ignored when evaluating the records of the two judges.
Table 17.2: (C02T01) Both Judges
Judge_Verdict xUpheld xReversed SUM
Abel 129 (86%) 21 (14%) 150
Ken 110 (88%) 15 (12%) 125
Total 239 (86.9%) 36 (13.1%) 275
Table 17.2: (C02T01A) Abel
Abel xUpheld xReversed SUM
Common 29 (90.6%) 3 (9.4%) 32
Municipal 100 (84.7%) 18 (15.3%) 118
Total 129 (86%) 21 (14%) 150
Table 17.2: (C02T01B) Ken
Ken xUpheld xReversed SUM
Common 90 (90%) 10 (10%) 100
Municipal 20 (80%) 5 (20%) 25
Total 110 (88%) 15 (12%) 125

Judges

# #Judges: Because we are evaluating 'Judges', they are the 'elements' and thus are the 'rows'
xxJudges <- tibble(Judge_Verdict = c('Abel', 'Ken'), Upheld = c(129, 110), Reversed = c(21, 15))
# #Uaggregated crosstab for both Judges in different types of Courts
xxKen <- tibble(Ken = c("Common", "Municipal "), 
                    Upheld = c(90, 20), Reversed = c(10, 5))
xxAbel <- tibble(Abel = c("Common", "Municipal "), 
                    Upheld = c(29, 100), Reversed = c(3, 18))

Transpose

# #Judges
aa <- tibble(Judge_Verdict = c('Abel', 'Ken'), Upheld = c(129, 110), Reversed = c(21, 15))
bb <- tibble(Verdict_Judge = c('Upheld', 'Reversed'), Abel = c(129, 21), Ken = c(110, 15))
aa
## # A tibble: 2 x 3
##   Judge_Verdict Upheld Reversed
##   <chr>          <dbl>    <dbl>
## 1 Abel             129       21
## 2 Ken              110       15
# #Transpose, Assuming First Column Header has "Row_Col" Type Format
ii <- aa %>% 
  `attr<-`("ColsLost", unlist(strsplit(names(.)[1], "_"))[1]) %>% 
  `attr<-`("RowsKept", unlist(strsplit(names(.)[1], "_"))[2]) %>% 
  pivot_longer(c(-1), 
               names_to = paste0(attributes(.)$RowsKept, "_", attributes(.)$ColsLost), 
               values_to = "Values") %>% 
  pivot_wider(names_from = 1, values_from = Values) %>% 
  `attr<-`("ColsLost", NULL) %>% `attr<-`("RowsKept", NULL) 
stopifnot(identical(bb, ii))
ii
## # A tibble: 2 x 3
##   Verdict_Judge  Abel   Ken
##   <chr>         <dbl> <dbl>
## 1 Upheld          129   110
## 2 Reversed         21    15
# #Testing for Reverse
ii <- bb %>% 
  `attr<-`("ColsLost", unlist(strsplit(names(.)[1], "_"))[1]) %>% 
  `attr<-`("RowsKept", unlist(strsplit(names(.)[1], "_"))[2]) %>% 
  pivot_longer(c(-1), 
               names_to = paste0(attributes(.)$RowsKept, "_", attributes(.)$ColsLost), 
               values_to = "Values") %>% 
  pivot_wider(names_from = 1, values_from = Values) %>% 
  `attr<-`("ColsLost", NULL) %>% `attr<-`("RowsKept", NULL) 
stopifnot(identical(aa, ii))

String Split

bb <- "Judge_Verdict"
# #Split String by strsplit(), output is list
(ii <- unlist(strsplit(bb, "_")))
## [1] "Judge"   "Verdict"
#
# #Split on Dot 
bb <- "Judge.Verdict"
# #Using character classes
ii <- unlist(strsplit(bb, "[.]"))
# #By escaping special characters
jj <- unlist(strsplit(bb, "\\."))
# #Using Options
kk <- unlist(strsplit(bb, ".", fixed = TRUE))
stopifnot(all(identical(ii, jj), identical(ii, kk)))

Attributes

  • Tibble
    • ‘problems’ attribute contains List of All Problems
      • problems(bb)
    • ‘spec’ attribute contains List of Columns and Types
      • spec(bb)
      • Caution: SHOW Snapshot at Import NOT the current Status, Better To Be Removed
jj <- ii <- bb <- aa
# #attr() adds or removes an attribute
attr(bb, "NewOne") <- "abc"
# #Using Backticks
ii <- `attr<-`(ii, "NewOne", "abc")
# #Using Pipe
jj <- jj %>% `attr<-`("NewOne", "abc")
#
stopifnot(all(identical(bb, ii), identical(bb, jj)))
#
# #List Attributes
names(attributes(bb))
## [1] "class"     "row.names" "names"     "NewOne"
#
# #Specific Attribute Value
attributes(bb)$NewOne
## [1] "abc"
#
# #Remove Attributes
attr(bb, "NewOne") <- NULL
ii <- `attr<-`(ii, "NewOne", NULL)
jj <- jj %>% `attr<-`("NewOne", NULL)
stopifnot(all(identical(bb, ii), identical(bb, jj)))

Total Row

# #(Deprecated) Issues: 
# #(1) bind_rows() needs two dataframes. Thus, first can be skipped in Pipe, But...
# #The second dataframe cannot be replaced with dot (.), it has to have a name
# #(2) Pipe usage inside function call was working but was a concern
# #(3) It introduced NA for that replace was needed as another step
ii <- aa %>% bind_rows(aa %>% summarise(across(where(is.numeric), sum))) %>%
    mutate(across(1, ~ replace(., . %in% NA, "Total"))) 
#
# #(Deprecated) Works but needs ALL Column Names individually
jj <- aa %>% add_row(Judge_Verdict = "Total", Upheld = sum(.[, 2]), Reversed = sum(.[, 3]))
kk <- aa %>% add_row(Judge_Verdict = "Total", Upheld = sum(.$Upheld), Reversed = sum(.$Reversed))
#
# #(Deprecated) Removed the Multiple call to sum(). However, it needs First Column Header Name
ll <- aa %>% add_row(Judge_Verdict = "Total", summarise(., across(where(is.numeric), sum)))
# #(Deprecated) Replaced Column Header Name using "Tilde"
mm <- aa %>% add_row(summarise(., across(where(is.character), ~"Total")), 
               summarise(., across(where(is.numeric), sum)))
stopifnot(all(identical(ii, jj), identical(ii, kk), identical(ii, ll), identical(ii, mm)))
#
# #(Working): Minimised
aa %>% add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.numeric), sum)))
## # A tibble: 3 x 3
##   Judge_Verdict Upheld Reversed
##   <chr>          <dbl>    <dbl>
## 1 Abel             129       21
## 2 Ken              110       15
## 3 Total            239       36

Replace NA

# # USE '%in%' for NA, otherwise '==' works
bb <- aa %>% bind_rows(aa %>% summarise(across(where(is.numeric), sum)))
#
ii <- bb %>% mutate(across(1, ~ replace(., . %in% NA, "Total"))) 
mm <- bb %>% mutate(across(1, ~ replace(., is.na(.), "Total"))) 
jj <- bb %>% mutate(Judge_Verdict = replace(Judge_Verdict, is.na(Judge_Verdict), "Total"))
kk <- bb %>% mutate(across(1, coalesce, "Total")) 
ll <- bb %>% mutate(across(1, ~ replace_na(.x, "Total")))
nn <- bb %>% mutate(across(1, replace_na, "Total"))
stopifnot(all(identical(ii, jj), identical(ii, kk), identical(ii, ll), 
              identical(ii, mm), identical(ii, nn)))
#
#   #Replace NA in a Factor
bb %>% 
  mutate(Judge_Verdict = factor(Judge_Verdict)) %>% 
  mutate(across(1, fct_explicit_na, na_level = "Total"))
## # A tibble: 3 x 3
##   Judge_Verdict Upheld Reversed
##   <fct>          <dbl>    <dbl>
## 1 Abel             129       21
## 2 Ken              110       15
## 3 Total            239       36

To Factor

#   #Convert to Factor
aa %>% mutate(Judge_Verdict = factor(Judge_Verdict))
## # A tibble: 2 x 3
##   Judge_Verdict Upheld Reversed
##   <fct>          <dbl>    <dbl>
## 1 Abel             129       21
## 2 Ken              110       15

Clipboard

# #Paste but do not execute
aa <- read_delim(clipboard())
# #Copy Excel Data, then execute the above command
#
# #Print its structure
dput(aa)
# #Copy the relevant values, headers in tibble()
bb <- tibble(  )
# #The above command will be the setup to generate this tibble anywhere

17.5 Exercise

C02E27

Data

ex27 <- tibble(Observation = 1:30, 
             x = c("A", "B", "B", "C", "B", "C", "B", "C", "A", "B", "A", "B", "C", "C", "C", 
                   "B", "C", "B", "C", "B", "C", "B", "C", "A", "B", "C", "C", "A", "B", "B"), 
             y = c(1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2, 
                   2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2, 2, 1, 1, 2))

CrossTab

bb <- ex27
str(bb)
## tibble [30 x 3] (S3: tbl_df/tbl/data.frame)
##  $ Observation: int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
##  $ x          : chr [1:30] "A" "B" "B" "C" ...
##  $ y          : num [1:30] 1 1 1 2 1 2 1 2 1 1 ...
# #Create CrossTab
bb <- bb %>% 
  count(x, y) %>% 
  pivot_wider(names_from = y, values_from = n, values_fill = 0)

PCT

bb
## # A tibble: 3 x 3
##   x       `1`   `2`
##   <chr> <int> <int>
## 1 A         5     0
## 2 B        11     2
## 3 C         2    10
# #Rowwise Percentage in Separate New Columns
bb %>% 
  mutate(SUM = rowSums(across(where(is.numeric)))) %>% 
  mutate(across(where(is.numeric), ~ round(. * 100 /SUM, 1), .names = "{.col}_Row" )) 
## # A tibble: 3 x 7
##   x       `1`   `2`   SUM `1_Row` `2_Row` SUM_Row
##   <chr> <int> <int> <dbl>   <dbl>   <dbl>   <dbl>
## 1 A         5     0     5   100       0       100
## 2 B        11     2    13    84.6    15.4     100
## 3 C         2    10    12    16.7    83.3     100
#
# #Rowwise Percentage in Same Columns
bb %>% 
  mutate(SUM = rowSums(across(where(is.numeric)))) %>% 
  mutate(across(where(is.numeric), ~ round(. * 100 /SUM, 1))) 
## # A tibble: 3 x 4
##   x       `1`   `2`   SUM
##   <chr> <dbl> <dbl> <dbl>
## 1 A     100     0     100
## 2 B      84.6  15.4   100
## 3 C      16.7  83.3   100
#
# #Equivalent
bb %>% 
  mutate(SUM = rowSums(across(where(is.numeric))),
         across(where(is.numeric), ~ round(. * 100 /SUM, 1))) 
## # A tibble: 3 x 4
##   x       `1`   `2`   SUM
##   <chr> <dbl> <dbl> <dbl>
## 1 A     100     0     100
## 2 B      84.6  15.4   100
## 3 C      16.7  83.3   100
#
# #Columnwise Percentage in Separate New Columns
bb %>% 
  mutate(across(where(is.numeric), ~ round(. * 100 /sum(.), 1), .names = "{.col}_Col" ))
## # A tibble: 3 x 5
##   x       `1`   `2` `1_Col` `2_Col`
##   <chr> <int> <int>   <dbl>   <dbl>
## 1 A         5     0    27.8     0  
## 2 B        11     2    61.1    16.7
## 3 C         2    10    11.1    83.3
# #Columnwise Percentage in Same Columns
bb %>% 
  mutate(across(where(is.numeric), ~ round(. * 100 /sum(.), 1)))
## # A tibble: 3 x 3
##   x       `1`   `2`
##   <chr> <dbl> <dbl>
## 1 A      27.8   0  
## 2 B      61.1  16.7
## 3 C      11.1  83.3

C02E28

Data

ex28 <- tibble(Observation = 1:20, 
        x = c(28, 17, 52, 79, 37, 71, 37, 27, 64, 53, 13, 84, 59, 17, 70, 47, 35, 62, 30, 43), 
        y = c(72, 99, 58, 34, 60, 22, 77, 85, 45, 47, 98, 21, 32, 81, 34, 64, 68, 67, 39, 28))

CrossTab

bb <- ex28
# #Rounding to the lowest 10s before min and to the highest 10s after max
nn <- 10L   
n_x <- seq(floor(min(bb$x) / nn) * nn, ceiling(max(bb$x) / nn) * nn, by = 20)
n_y <- seq(floor(min(bb$y) / nn) * nn, ceiling(max(bb$y) / nn) * nn, by = 20)
#
# #Labels in the format of [10-29]
lab_x <- paste0(n_x, "-", n_x + 20 - 1) %>% head(-1)
lab_y <- paste0(n_y, "-", n_y + 20 - 1) %>% head(-1)

# #Wider Table without Totals
ii <- bb %>% 
  mutate(x_bins = cut(x, breaks = n_x, right = FALSE, labels = lab_x),
         y_bins = cut(y, breaks = n_y, right = FALSE, labels = lab_y)) %>% 
  count(x_bins, y_bins) %>% 
  pivot_wider(names_from = y_bins, values_from = n, values_fill = 0, names_sort = TRUE)
print(ii)
## # A tibble: 4 x 5
##   x_bins `20-39` `40-59` `60-79` `80-99`
##   <fct>    <int>   <int>   <int>   <int>
## 1 10-29        0       0       1       4
## 2 30-49        2       0       4       0
## 3 50-69        1       3       1       0
## 4 70-89        4       0       0       0
# #Cross Tab with Total Column and Total Row
jj <- ii %>% 
  bind_rows(ii %>% summarise(across(where(is.numeric), sum))) %>% 
    mutate(across(1, fct_explicit_na, na_level = "Total")) %>% 
    mutate(SUM = rowSums(across(where(is.numeric))))
print(jj)
## # A tibble: 5 x 6
##   x_bins `20-39` `40-59` `60-79` `80-99`   SUM
##   <fct>    <int>   <int>   <int>   <int> <dbl>
## 1 10-29        0       0       1       4     5
## 2 30-49        2       0       4       0     6
## 3 50-69        1       3       1       0     5
## 4 70-89        4       0       0       0     4
## 5 Total        7       3       6       4    20

cut()

  • cut()
    • It slightly increases the range
    • ggplot2::cut_interval(), cut_width() do not increase the range
    • dig.lab : Options to exclude scientific notation
    • ordered_result : Option for ordered factor
# #Group Continuous Data to Categorical Bins by base::cut()
bb <- ex28
#
# #NOTE cut() increases the range slightly but ggplot functions do not
bb %>% mutate(x_bins = cut(x, breaks = 8)) %>% 
  pull(x_bins) %>% levels()
## [1] "(12.9,21.9]" "(21.9,30.8]" "(30.8,39.6]" "(39.6,48.5]" "(48.5,57.4]" "(57.4,66.2]"
## [7] "(66.2,75.1]" "(75.1,84.1]"
# 
# #By default, it excludes the lower range, but it can be included by option
bb %>% mutate(x_bins = cut(x, breaks = 8, include.lowest = TRUE)) %>% 
  pull(x_bins) %>% levels()
## [1] "[12.9,21.9]" "(21.9,30.8]" "(30.8,39.6]" "(39.6,48.5]" "(48.5,57.4]" "(57.4,66.2]"
## [7] "(66.2,75.1]" "(75.1,84.1]"
#
# #ggplot::cut_interval() makes n groups with equal range. There is a cut_number() also
bb %>% mutate(x_bins = cut_interval(x, n = 8)) %>% 
  pull(x_bins) %>% levels()
## [1] "[13,21.9]"   "(21.9,30.8]" "(30.8,39.6]" "(39.6,48.5]" "(48.5,57.4]" "(57.4,66.2]"
## [7] "(66.2,75.1]" "(75.1,84]"
#
# #Specific Bins
bb %>% mutate(x_bins = cut(x, breaks = seq(10, 90, by = 10))) %>% 
  pull(x_bins) %>% levels()
## [1] "(10,20]" "(20,30]" "(30,40]" "(40,50]" "(50,60]" "(60,70]" "(70,80]" "(80,90]"
ii <- bb %>% mutate(x_bins = cut(x, breaks = seq(10, 90, by = 10), include.lowest = TRUE)) %>% 
  pull(x_bins) %>% levels()
print(ii)
## [1] "[10,20]" "(20,30]" "(30,40]" "(40,50]" "(50,60]" "(60,70]" "(70,80]" "(80,90]"
#
# #ggplot::cut_width() makes groups of width
bb %>% mutate(x_bins = cut_width(x, width = 10)) %>% 
  pull(x_bins) %>% levels()
## [1] "[5,15]"  "(15,25]" "(25,35]" "(35,45]" "(45,55]" "(55,65]" "(65,75]" "(75,85]"
#
# #Match cut_width() and cut()
jj <- bb %>% mutate(x_bins = cut_width(x, width = 10, boundary = 0)) %>% 
  pull(x_bins) %>% levels()
print(jj)
## [1] "[10,20]" "(20,30]" "(30,40]" "(40,50]" "(50,60]" "(60,70]" "(70,80]" "(80,90]"
stopifnot(identical(ii, jj))
#
# #Labelling
n_breaks <- seq(10, 90, by = 10)
n_labs <- paste0("*", n_breaks, "-", n_breaks + 10) %>% head(-1)

bb %>% mutate(x_bins = cut(x, breaks = n_breaks, include.lowest = TRUE, labels = n_labs)) %>% 
  pull(x_bins) %>% levels()
## [1] "*10-20" "*20-30" "*30-40" "*40-50" "*50-60" "*60-70" "*70-80" "*80-90"

17.6 Summarizing Data for Two Variables

  • Scatterplot and Trendline
  • Side by Side and Stacked Bar Charts

Data

xxCommercials <- tibble(Week = 1:10, 
                 Commercials = c(2, 5, 1, 3, 4, 1, 5, 3, 4, 2), 
                 Sales = c(50, 57, 41, 54, 54, 38, 63, 48, 59, 46))
f_setRDS(xxCommercials)
geom_point(), geom_smooth(), & stat_poly_eq()

Figure 17.4 geom_point(), geom_smooth(), & stat_poly_eq()

Trendline

bb <- xxCommercials 

# #Define the formula for Trendline calculation
k_gg_formula <- y ~ x
#
# #Scatterplot, Trendline alongwith its equation and R2 value
C02P05 <- bb %>% {
  ggplot(data = ., aes(x = Commercials, y = Sales)) + 
  geom_smooth(method = 'lm', formula = k_gg_formula, se = FALSE) +
  stat_poly_eq(aes(label = paste0("atop(", ..eq.label.., ", \n", ..rr.label.., ")")), 
               formula = k_gg_formula, eq.with.lhs = "italic(hat(y))~`=`~",
               eq.x.rhs = "~italic(x)", parse = TRUE) +
  geom_point() + 
  labs(x = "Commercials", y = "Sales ($100s)", 
       subtitle = paste0("Trendline equation and R", '\u00b2', " value"), 
       caption = "C02P05", title = "Scatter Plot")
}

Validation


18 Numerical Measures

18.1 Overview

18.2 Definitions (Ref)

16.22 The measurable quality or characteristic is called a Population Parameter if it is computed from the population. It is called a Sample Statistic if it is computed from a sample.

18.3 Number Theory

Definition 18.1 A number is a mathematical object used to count, measure, and label. Their study or usage is called arithmetic, a term which may also refer to number theory, the study of the properties of numbers.

Individual numbers can be represented by symbols, called numerals; for example, “5” is a numeral that represents the ‘number five.’

As only a relatively small number of symbols can be memorized, basic numerals are commonly organized in a numeral system, which is an organized way to represent any number. The most common numeral system is the Hindu–Arabic numeral system, which allows for the representation of any number using a combination of ten fundamental numeric symbols, called digits.

Counting is the process of determining the number of elements of a finite set of objects, i.e., determining the size of a set. Enumeration refers to uniquely identifying the elements of a set by assigning a number to each element.

Measurement is the quantification of attributes of an object or event, which can be used to compare with other objects or events.

Sets

Formally, \(\mathbb{N} \to \mathbb{Z} \to \mathbb{Q} \to \mathbb{R} \to \mathbb{C}\)
Practically, \(\mathbb{N} \subset \mathbb{Z} \subset \mathbb{Q} \subset \mathbb{R} \subset \mathbb{C}\)

The natural numbers \(\mathbb{N}\) are those numbers used for counting and ordering. ISO standard begin the natural numbers with 0, corresponding to the non-negative integers \(\mathbb{N} = \{0, 1, 2, 3, \ldots \}\), whereas others start with 1, corresponding to the positive integers \(\mathbb{N^*} = \{1, 2, 3, \ldots \}\)

The set of integers \(\mathbb{Z}\) consists of zero (\({0}\)), the positive natural numbers \(\{1, 2, 3, \ldots \}\) and their additive inverses (the negative integers). Thus i.e., \(\mathbb{Z} = \{\ldots, -3, -2, -1, 0, 1, 2, 3, \ldots \}\). An integer is colloquially defined as a number that can be written without a fractional component.

Rational numbers \(\mathbb{Q}\) are those which can be expressed as the quotient or fraction p/q of two integers, a numerator p and a non-zero denominator q. Thus, Rational Numbers \(\mathbb{Q} = \{1, 2, 3, \ldots \}\)

A real number is a value of a continuous quantity that can represent a distance along a line. The real numbers include all the rational numbers \(\mathbb{Q}\), and all the irrational numbers. Thus, Real Numbers \(\mathbb{R} = \mathbb{Q} \cup \{\sqrt{2}, \sqrt{3}, \ldots\} \cup \{ \pi, e, \phi, \ldots \}\)

The complex numbers \(\mathbb{C}\) contain numbers which are expressed in the form \(a + ib\), where \({a}\) and \({b}\) are real numbers. These have two components the real numbers and a specific element denoted by \({i}\) (imaginary unit) which satisfies the equation \(i^2 = −1\).

Pi

The number Pi \(\pi = 3.14159\ldots\) is defined as the ratio of circumference of a circle to its diameter.

\[\pi = \int _{-1}^{1} \frac{dx}{\sqrt{1- x^2}} \tag{18.1}\]

\[e^{i\varphi}=\cos \varphi + i\sin \varphi \tag{18.2}\]

\[e^{i\pi} + 1 = 0 \tag{18.3}\]

# #Read OIS File for 20000 PI digits including integral (3) and fractional (14159...)
# #md5sum = "daf0b33a67fd842a905bb577957a9c7f"
tbl <- read_delim(file = paste0(.z$XL, "PI-OIS-b000796.txt"), 
  delim = " ", col_names = c("POS", "VAL"), col_types = list(POS = "i", VAL = "i"))
attr(tbl, "spec") <- NULL
attr(tbl, "problems") <- NULL
xxPI <- tbl
f_setRDS(xxPI)

e

Euler Number \(e = 2.71828\ldots\), is the base of the natural logarithm.

\[e = \lim_{n \to \infty} \left(1 + \frac{1}{n} \right)^{n} = \sum \limits_{n=0}^{\infty} \frac{1}{n!} \tag{18.4}\]

Phi

Two quantities are in the golden ratio \(\varphi = 1.618\ldots\) if their ratio is the same as the ratio of their sum to the larger of the two quantities.

\[\varphi^2 - \varphi - 1 = 0 \\ \varphi = \frac{1 + \sqrt{5}}{2} \tag{18.5}\]

Groups

Definition 18.2 A prime number is a natural number greater than 1 that is not a product of two smaller natural numbers. A natural number greater than 1 that is not prime is called a ‘composite number.’ 1 is neither a Prime nor a composite, it is a ‘Unit.’ Thus, by definition, Negative Integers and Zero cannot be Prime.
Definition 18.3 Parity is the property of an integer \(\mathbb{Z}\) of whether it is even or odd. It is even if the integer is divisible by 2 with no remainders left and it is odd otherwise. Thus, -2, 0, +2 are even but -1, 1 are odd. Numbers ending with 0, 2, 4, 6, 8 are even. Numbers ending with 1, 3, 5, 7, 9 are odd.
Definition 18.4 An integer \(\mathbb{Z}\) is positive if it is greater than zero, and negative if it is less than zero. Zero is defined as neither negative nor positive.
Definition 18.5 Mersenne primes are those prime number that are of the form \((2^n -1)\); that is, \(\{3, 7, 31, 127, \ldots \}\)

Mersenne primes:

  • \(\{3, 7, 31, 127, 8191, 131071, 524287, 2147483647, 2305843009213693951, \ldots \}\)
  • \(\{3 (2^{nd}), 7(4^{th}), 31(11^{th}), 127(31^{st}), 8191 (1028^{th}), 131071 (12,251^{th}), 524287 (43,390^{th}), \ldots \}\)
    • Mersenne primes with their position in List of Primes
  • \(2147483647 = (2^{231} − 1)\)
    • It is 105,097,565\(^{th}\) Prime, \(8^{th}\) Mersenne prime and is one of only four known double Mersenne primes.
    • It represents the largest value that a signed 32-bit integer field can hold.

18.4 Primes

Empty Vector

# #Create empty Vector with NA
aa <- rep(NA_integer_, 10)
print(aa)
##  [1] NA NA NA NA NA NA NA NA NA NA

f_isPrime()

f_isPrime <- function(x) {
  # #Check if the number is Prime
  if(!is.integer(x)) {
    cat("Error! Integer required. \n")
    stop()
  } else if(x <= 0L) {
    cat("Error! Positive Integer required. \n")
    stop()
  } else if(x > 2147483647L) {
    cat(paste0("Doubles are stored as approximation. Prime will not be calculated for value higher than '2147483647' \n"))
    stop()
  }
  # #However, this checks the number against ALL Smaller values including non-primes
  if(x == 2L || all(x %% 2L:ceiling(sqrt(x)) != 0)) {
    # # "seq.int(3, ceiling(sqrt(x)), 2)" is slower
    return(TRUE)
  } else {
    ## (any(x %% 2L:ceiling(sqrt(x)) == 0))
    ## (any(x %% seq.int(3, ceiling(sqrt(x)), 2) == 0))
    ## NOTE Further, if sequence starts from 3, add 2 also as a Prime Number
    return(FALSE)
  }
}
# #Vectorise Version
f_isPrimeV <- Vectorize(f_isPrime)
# #Compiled Version
f_isPrimeC <- cmpfun(f_isPrime)

Primes

# #There are 4 Primes in First 10, 25 in 100, 168 in 1000, 1229 in 10000.
# # Using Vectorise Version, get all the Primes
aa <- 1:10
bb <- aa[f_isPrimeV(aa)]
ii <- f_getPrimeUpto(10)
stopifnot(identical(bb, ii))
# #
xxPrime10 <- c(2, 3, 5, 7) |> as.integer()
# #
xxPrime100 <- c(2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 
               53, 59, 61, 67, 71, 73, 79, 83, 89, 97)  |> as.integer()
#
# #Generate List of ALL Primes till 524287 (i.e. Total 43,390 Primes)
xxPrimes <- f_getPrimeUpto(524287L)
# #Save as RDS
f_setRDS(xxPrimes)

Large Integers

# #NOTE: Assigning 2147483647L causes the Chunk to throw Warnings even with 'eval=FALSE'.
if(FALSE){
# #Assignment of 2305843009213693951L is NOT possible without Warning
# #Even within non-executing Block or with 'eval=FALSE' or suppressWarnings() or tryCatch()
# #It cannot be stored as integer, thus it is automatically converted to double
  #bb <- 2305843009213693951L
# #Warning: non-integer value 2305843009213693951L qualified with L; using numeric value 
# #NOTE that the value changed. It is explicitly NOT a prime anymore.
  #print(aa, digits = 20)
# #[1] 2305843009213693952
#
# #Assignment of 2147483647L is possible and direct printing in console works BUT
# #Its printing will also throw Warnings that are difficult to handle
# #Avoid Printing. Even within non-executing Block, it is affecting R Bookdown.
  aa <- 2147483647L
  #print(aa)
}

f_getPrime()

f_getPrimeUpto <- function(x){
  # #Get a Vector of Primes upto the given Number (Max. 524287)
  if(x < 2) {
    print("NOT ALLOWED!")
    return(NULL)
  } else if(x > 524287){
    print("Sadly, beyond this number it is difficult to generate the List of Primes!")
    return(NULL)
  }
  y <- 2:x
  i <- 1
  while (y[i] <= sqrt(x)) {
    y <-  y[y %% y[i] != 0 | y == y[i]]
    i <- i+1
  }
  return(y)
}

Benchmark

# #Compare any number of functions
result <- microbenchmark(
  sum(1:100)/length(1:100), 
  mean(1:100),
  #times = 1000,
  check = 'identical'
)
# #Print Table
print(result)
##Unit: microseconds
##                     expr min    lq    mean median     uq    max neval cld
## sum(1:100)/length(1:100) 1.2 1.301 1.54795 1.5005 1.6005  7.501   100  a 
##              mean(1:100) 5.9 6.001 6.56989 6.1010 6.2010 28.001   100   b
#
# #Boxplot of Benchmarking Result
#autoplot(result)
# #Above testcase showed a surprising result of sum()/length() being much faster than mean()
#
# #Or Compare Plot Rendering
if(FALSE) microbenchmark(print(jj), print(kk), print(ll), times = 2)

Sum-Mean

“ForLater” - Include rowsum(), rowSums(), colSums(), rowMeans(), colMeans() in this also.

# #Conclusion: use mean() because precision is difficult to achieve compared to speed
#
# #sum()/length() is faster than mean()
# #However, mean() does double pass, so it would be more accurate
# #mean.default() and var() compute means with an additional pass and so are more accurate
# #e.g. the variance of a constant vector is (almost) always zero 
# #and the mean of such a vector will be equal to the constant value to machine precision.
aa <- 1:100
#
microbenchmark(
  sum(aa)/length(aa), 
  mean(aa),
  mean.default(aa),
  .Internal(mean(aa)),
  #times = 1000,
  check = 'identical'
)
## Unit: nanoseconds
##                 expr  min   lq mean median   uq   max neval cld
##   sum(aa)/length(aa)  400  400  500    500  500  2200   100 a  
##             mean(aa) 3900 4000 4299   4100 4200 19300   100   c
##     mean.default(aa) 1400 1500 1555   1500 1600  5400   100  b 
##  .Internal(mean(aa))  500  500  572    500  600  2900   100 a
# #rnorm() generates random deviates of given length
set.seed(3)
aa <- rnorm(1e7)
str(aa)
##  num [1:10000000] -0.962 -0.293 0.259 -1.152 0.196 ...
#
# #NOTE manual calculation and mean() is NOT matching
identical(sum(aa)/length(aa), mean(aa))
## [1] FALSE
#
# #There is a slight difference
sum(aa)/length(aa) - mean(aa)
## [1] 2.355429e-17

Remove Objects

if(FALSE) {
  # #Remove all objects matching a pattern
  rm(list = ls(pattern = "f_"))
}

Options Memory

# #Check the Current Options Value
getOption("expressions")
## [1] 5000
if(FALSE) {
  # #Change Value
  # #NOTE it did not help when recursive function failed
  # #Error: node stack overflow
  # #Error during wrapup: node stack overflow
  # #Error: no more error handlers available ...
  options(expressions=10000)
}

Vectorize()

# #To Vectorise a Function
f_isPrimeV <- Vectorize(f_isPrime)

Compiling

# #To Pre-Compile a Function for faster performance
f_isPrimeC <- cmpfun(f_isPrime)

Profiling

# #To Profile a Function Calls for improvements
Rprof("file.out")
f_isPrime(2147483647L)
#f_getPrimesUpto(131071L)
Rprof(NULL)
summaryRprof("file.out")

Legacy A

# #Functions to check for PRIME - All of them have various problems
# #"-3L -2L -1L 0L 1L 8L" FALSE "2L 3L ... 524287L 2147483647L" TRUE
isPrime_a <- function(x) {
  # #Fails for "2147483647L" Error: cannot allocate vector of size 8.0 Gb
  if (x == 2L) {
    return(TRUE)
  } else if (any(x %% 2:(x-1) == 0)) {
    return(FALSE)
  } else return(TRUE)
}

isPrime_b <- function(x){
  # #Comparison of Division and Integer Division by 1, 2, ..., x
  # #Fails for "2147483647L" Error: cannot allocate vector of size 16.0 Gb
  # #Fails for "-ve and zero" Error: missing value where TRUE/FALSE needed
  # vapply(x, function(y) sum(y / 1:y == y %/% 1:y), integer(1L)) == 2L
  if(sum(x / 1:x == x %/% 1:x) == 2) {
    return(TRUE) 
  } else return(FALSE)
}

isPrime_c <- function(x) {
  # #RegEx Slowest: Iit converts -ve values and coerce non-integers which may result in bugs
  x <- abs(as.integer(x))
  if(x > 8191L) {
    print("Do not run this with large values. RegEx is really slow.")
    stop()
  }
  !grepl('^1?$|^(11+?)\\1+$', strrep('1', x))
}

isPrime_d <- function(x) {
  # #Fails for "1" & returns TRUE
  # #Fails for "-ve and zero" Error: NA/NaN argument
  if(x == 2L || all(x %% 2L:max(2, floor(sqrt(x))) != 0)) {
    return(TRUE)
  } else return(FALSE)
}

isPrime_e <- function(x) {
  # #Fails for "-ve and zero" Error: NA/NaN argument
  # #This is the most robust which can be improved by conditional check for positive integers
  # #However, this checks the number against ALL Smaller values including non-primes
  if(x == 2L || all(x %% 2L:ceiling(sqrt(x)) != 0)) {
    # # "seq.int(3, ceiling(sqrt(x)), 2)" is slower
    return(TRUE)
  } else {
    ## (any(x %% 2L:ceiling(sqrt(x)) == 0))
    ## (any(x %% seq.int(3, ceiling(sqrt(x)), 2) == 0))
    ## NOTE Further, if sequence starts from 3, add 2 also as a Prime Number
    return(FALSE)
  }
}

Legacy B

# #131071 (12,251th), 524287 (43,390th), 2147483647 (105,097,565th)
aa <- 1:131071
# #Following works but only till 524287L, Memory Overflow ERROR for 2147483647L
bb <- aa[f_isPrimeV(aa)]

getPrimeUpto_a <- function(x){
  # #Extremely slow, cannot go beyond 8191L in benchmark testing
  if(x < 2) return("ERROR")
  y <- 2:x
  primes <- rep(2L, x)
  j <- 1L
  for (i in y) {
    if (!any(i %% primes == 0)) {
      j <- j + 1L
      primes[j] <- i
      #cat(paste0("i=", i, ", j=", j, ", Primes= ", paste0(head(primes, j), collapse = ", ")))
    }
    #cat("\n")
  }
  result <- head(primes, j)
  #str(result)
  #cat(paste0("Head: ", paste0(head(result), collapse = ", "), "\n"))
  #cat(paste0("Tail: ", paste0(tail(result), collapse = ", "), "\n"))
  return(result)
}

getPrimeUpto_b <- function(x){
# #https://stackoverflow.com/questions/3789968/
  # #This is much faster even from the "aa[f_isPrimeV(aa)]"
    if(x < 2) return("ERROR")
    y <- 2:x
    i <- 1
    while (y[i] <= sqrt(x)) {
        y <-  y[y %% y[i] != 0 | y == y[i]]
        i <- i+1
    }
    result <- y
    #str(result)
    #cat(paste0("Head: ", paste0(head(result), collapse = ", "), "\n"))
    #cat(paste0("Tail: ", paste0(tail(result), collapse = ", "), "\n"))
    return(result)
}

getPrimeUpto_c <- function(x) {
  # #Problems and Slow
  # #Returns a Vetor of Primes till the Number i.e. f_getPrimesUpto(7) = (2, 3, 5, 7)
  # #NOTE: f_getPrimesUpto(1) and f_getPrimesUpto(2) both return "2"
  if(!is.integer(x)) {
    cat("Error! Integer required. \n")
    stop()
  } else if(!identical(1L, length(x))) {
    cat("Error! Unit length vector required. \n")
    stop()
  } else if(x <= 0L) {
    cat("Error! Positive Integer required. \n")
    stop()
  } else if(x > 2147483647) {
    cat(paste0("Doubles are stored as approximation. Prime will not be calculated for value higher than '2147483647' \n"))
    stop()
  }
  
  # #Cannot create vector of length 2147483647L and also not needed that many
  # #ceiling(sqrt(7L)) return 3, however we need length 4 (2, 3, 5, 7)
  # #So, added extra 10
  #primes <- rep(NA_integer_, 10L + sqrt(2L))
  primes <- rep(2L, 10L + sqrt(2L))
  j <- 1L
  primes[j] <- 2L
  #
  i <- 2L
  while(i <= x) {
    # #na.omit() was the slowest step, so changed all NA to 2L in the primes
    #k <- na.omit(primes[primes <= ceiling(sqrt(i))])
    k <- primes[primes <= ceiling(sqrt(i))]
    if(all(as.logical(i %% k))) {
      j <- j + 1
      primes[j] <- i
    }  
    # #Increment with INTEGER Addition
    i = i + 1L
  }
  result <- primes[complete.cases(primes)]
  str(result)
  cat(paste0("Head: ", paste0(head(result), collapse = ", "), "\n"))
  cat(paste0("Tail: ", paste0(tail(result), collapse = ", "), "\n"))
  return(result)
}

getPrimeUpto_d <- function(n = 10L, i = 2L, primes = c(2L), bypass = TRUE){
  # #Using Recursion is NOT a good solution
  # #Function to return N Primes upto 1000 Primes (7919) or Max Value reaching 10000.
  if(i > 10000){
    cat("Reached 10000 \n")
    return(primes)
  }
  if(bypass) {
    maxN <- 1000L
    if(!is.integer(n)) {
      cat("Error! Integer required. \n")
      stop()
    } else if(!identical(1L, length(n))) {
      cat("Error! Unit length vector required. \n")
      stop()
    } else if(n <= 0L) {
      cat("Error! Positive Integer required. \n")
      stop()
    } else if(n > maxN) {
      cat(paste0("Error! This will calculate only upto ", maxN, " prime Numebers. \n"))
      stop()
    }
  }
  if(length(primes) < n) {
    if(all(as.logical(i %% primes[primes <= ceiling(sqrt(i))]))) {
      # #Coercing 0 to FALSE, Non-zero Values to TRUE
      # # "i %% 2L:ceiling(sqrt(i))" checks i agains all integers till sqrt(i)
      # # "primes[primes <= ceiling(sqrt(i))]" checks i against only the primes till sqrt(i)
      # #However, the above needs hardcoded 2L as prime so the vector is never empty
      # #Current Number is Prime, so include it in the vector and check the successive one
      f_getPrime(n, i = i+1, primes = c(primes, i), bypass = FALSE)
    } else {
      # #Current Number is NOT Prime, so check the successive one
      f_getPrime(n, i = i+1, primes = primes, bypass = FALSE)
    }
  } else {
    # #Return the vector when it reaches the count
    return(primes)
  }
}

18.5 Measures of Location

Definition 18.6 Measures of location are numerical summaries that indicate where on a number line a certain characteristic of the variable lies. Examples of the measures of location are percentiles and quantiles.
Definition 18.7 The measures of center are a special case of measures of location. These estimate where the center of a particular variable lies. Most common are Mean, Median, and Mode.

18.5.1 Mean

Definition 18.8 Given a data set \({X = \{{x}_1, {x}_2, \ldots, {x}_n\}}\), the mean \({\overline{x}}\) is the sum of all of the values \({{x}_1, {x}_2, \ldots, {x}_n}\) divided by the count \({n}\).
  • Refer equation (18.6)
    • Sample mean is denoted by \({\overline{x}}\) (x bar) and Population mean is denoted by \({\mu}\).
    • Mean is the most commonly used measure of central location, even though it is influenced by extreme values.

\[\overline{x} = \frac{1}{n}\left (\sum_{i = 1}^n{{x}_i}\right ) = \frac{{x}_1 + {x}_2 + \cdots + {x}_n}{n} \tag{18.6}\]

In the mean calculation, normally each \({{x}_i}\) is given equal importance or weightage of \({1/n}\). However, in some instances the mean is computed by giving each observation a weight that reflects its relative importance. A mean computed in this manner is referred to as the weighted mean, as given in equation (18.7)

\[\overline{x} = \frac{\sum_{i=1}^n{w_ix_i}}{\sum_{i=1}^n{w_i}} \tag{18.7}\]

Caution: Unit of mean is same as unit of the variable e.g. cost_per_kg thus ‘w’ would be ‘kg.’

Mean

aa <- 1:10
# #Mean of First 10 Numbers
mean(aa)
## [1] 5.5

More

aa <- 1:10
# #Mean of First 10 Numbers
ii <- mean(aa)
print(ii)
## [1] 5.5
jj <- sum(aa)/length(aa)
stopifnot(identical(ii, jj))
#
# #Mean of First 10 Prime Numbers (is neither Prime nor Integer)
mean(f_getRDS(xxPrimes)[1:10])
## [1] 12.9
#
# #Mean of First 100 Digits of PI
f_getRDS(xxPI)[1:100, ] %>% pull(VAL) %>% mean()
## [1] 4.71

Weighted Mean

aa <- tibble(Purchase = 1:5, cost_per_kg = c(3, 3.4, 2.8, 2.9, 3.25), 
             kg = c(1200, 500, 2750, 1000, 800))
# #NOTE that unit of mean is same as unit of the variable e.g. cost_per_kg thus 'w' would be 'kg'
(ii <- sum(aa$cost_per_kg * aa$kg)/sum(aa$kg))
## [1] 2.96
jj <- with(aa, sum(cost_per_kg * kg)/sum(kg))
kk <- weighted.mean(x = aa$cost_per_kg, w = aa$kg)
stopifnot(all(identical(ii, jj), identical(ii, kk)))

18.5.2 Median

Definition 18.9 Median of a population is any value such that at most half of the population is less than the proposed median and at most half is greater than the proposed median.
  • Refer equation (18.8)
    • The median is the value in the middle when the data is sorted
    • For an odd number of observations, the median is the middle value.
    • For an even number of observations, the median is the average of the two middle values.
    • Although the mean is the more commonly used measure of central location, whenever a data set contains extreme values, the median is preferred.
      • The mean and median are different concepts and answer different questions.
        • Ex: Income - nearly always reported as median, but if we are looking the the ‘spending power of whole community’ it may no not be right.
    • The median is well-defined for any ordered data, and is independent of any distance metric.
      • The median can thus be applied to classes which are ranked but not numerical (ordinal), although the result might be halfway between classes if there is an even number of cases.

\[\begin{align} \text{if n is odd, } median(x) & = x_{(n + 1)/2} \\ \text{if n is even, } median(x) & = \frac{x_{(n/2)} + x_{(n/2) + 1}}{2} \end{align} \tag{18.8}\]

Median

aa <- 1:10 
# #Median of First 10 Numbers
median(aa)
## [1] 5.5

More

aa <- 1:10 
# #Median of First 10 Numbers
median(aa)
## [1] 5.5
#
# #Median of First 10 Prime Numbers (is NOT prime)
median(f_getRDS(xxPrimes)[1:10])
## [1] 12
#
# #Median of First 100 Digits of PI
f_getRDS(xxPI)[1:100, ] %>% pull(VAL) %>% median()
## [1] 4.5

18.5.3 Geometric Mean

Definition 18.10 The geometric mean \(\overline{x}_g\) is a measure of location that is calculated by finding the n^{th} root of the product of \({n}\) values.
  • Refer equation (18.9)
    • The geometric mean applies only to positive numbers
    • The geometric mean is often used for a set of numbers whose values are meant to be multiplied together or are exponential in nature
    • For all positive data sets containing at least one pair of unequal values, the harmonic mean is always the least of the three means, while the arithmetic mean is always the greatest of the three and the geometric mean is always in between.

\[\overline{x}_g = \left(\prod _{i=1}^{n} x_i\right)^{\frac{1}{n}} = \sqrt[{n}]{x_1 x_2 \ldots x_n} \tag{18.9}\]

Geometric Mean

aa <- 1:10
# #Geometric Mean of of First 10 Numbers
exp(mean(log(aa)))
## [1] 4.528729

More

aa <- 1:10
# #Geometric Mean of of First 10 Numbers
ii <- exp(mean(log(aa)))
jj <- prod(aa)^(1/length(aa))
stopifnot(identical(ii, jj))
#
# #Geometric Mean of First 10 Prime Numbers 
exp(mean(log(f_getRDS(xxPrimes)[1:10])))
## [1] 9.573889

18.5.4 Mode

Definition 18.11 The mode is the value that occurs with greatest frequency.
  • The median makes sense when there is a linear order on the possible values. Unlike median, the concept of mode makes sense for any random variable assuming values from a vector space.

Mode

# #Mode of First 100 Digits of PI
bb <- f_getRDS(xxPI)[1:100, ] %>% pull(VAL)
f_getMode(bb)
## [1] 9

More

# #Mode of First 100 Digits of PI
bb <- f_getRDS(xxPI)[1:100, ]
#
# #Get Frequency
bb %>% count(VAL)
## # A tibble: 10 x 2
##      VAL     n
##    <int> <int>
##  1     0     8
##  2     1     8
##  3     2    12
##  4     3    12
##  5     4    10
##  6     5     8
##  7     6     9
##  8     7     8
##  9     8    12
## 10     9    13
#
# #Get Mode
bb %>% pull(VAL) %>% f_getMode()
## [1] 9

f_getMode()

f_getMode <- function(x) {
  # #Calculate Statistical Mode
  # #NOTE: Single Length, All NA, Characters etc. have NOT been validated
  # #https://stackoverflow.com/questions/56552709
  # #https://stackoverflow.com/questions/2547402
  # #Remove NA
  if (anyNA(x)) {
    x <- x[!is.na(x)]
  }
  # #Get Unique Values
  ux <- unique(x)
  # #Match
  ux[which.max(tabulate(match(x, ux)))]
}

18.5.5 Percentiles

Definition 18.12 A percentile provides information about how the data are spread over the interval from the smallest value to the largest value. For a data set containing \({n}\) observations, the \(p^{th}\) percentile divides the data into two parts: approximately p% of the observations are less than the \(p^{th}\) percentile, and approximately (100 – p)% of the observations are greater than the \(p^{th}\) percentile.
  • Refer equation (18.10)
    • Percentile is the value which divides the data into two groups when it is sorted
    • Quartiles are specific percentiles of 25%, 50% and 75%
    • Median is 50% percentile
    • Caution: Excel “PERCENTILE.EXC” calculations match with type =6 option of quantile(), default is type =7

\[L_p = \frac{p}{100}(n + 1) \tag{18.10}\]

Percentiles

# #First 100 Digits of PI
bb <- f_getRDS(xxPI)[1:100, ]
#
# #50% Percentile of Digits i.e. Median
quantile(bb$VAL, 0.5)
## 50% 
## 4.5

More

# #First 100 Digits of PI
bb <- f_getRDS(xxPI)[1:100, ]
#
# #50% Percentile of Digits i.e. Median
ii <- quantile(bb$VAL, 0.5)
print(ii)
## 50% 
## 4.5
jj <- median(bb$VAL)
stopifnot(identical(unname(ii), jj))
# 
# #All Quartiles
quantile(bb$VAL, seq(0, 1, 0.25))
##   0%  25%  50%  75% 100% 
## 0.00 2.00 4.50 7.25 9.00
# #summary()
summary(bb$VAL)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    2.00    4.50    4.71    7.25    9.00
#
# #To Match with Excel "PERCENTILE.EXC" use type=6 in place of default type=7
quantile(bb$VAL, seq(0, 1, 0.25), type = 6)
##   0%  25%  50%  75% 100% 
## 0.00 2.00 4.50 7.75 9.00

18.6 Measures of Variability

Definition 18.13 Measures of spread (or the measures of variability) describe how spread out the data values are. Examples are Range, SD, mean absolute deviation, and IQR

In addition to measures of location, it is often desirable to consider measures of variability, or dispersion.

  • Range range()
    • (Largest value - Smallest value) i.e. max() - min()
    • Range is based on only two of the observations and thus is highly influenced by extreme values.
  • Interquartile Range (IQR) IQR()
    • The difference between the third quartile \((Q3, 75\%)\), and the first quartile \((Q1, 25\%)\)
    • IQR is a measure of variability, much more robust than the SD. IQR is less sensitive to the presence of the outliers.
    • It overcomes the dependency on extreme values
    • \(x_i \notin [Q_1 - 1.5 * \text{IQR}, Q_3 + 1.5 * \text{IQR}] \to x_i \in \text{Outlier}\)
    • It is assumed that any data point not in [Q1 - 1.5 * IQR, Q3 + 1.5 * IQR] is an outlier
      • 1.5 multiplication gives us a window of \({\mu} \pm 2.7 {\sigma} \approx {\mu} \pm 3 {\sigma}\) which is quite closer to Normal Plot limit.
  • Mean Absolute Error (MAE)
    • \(\text{MAE} = \frac{\sum |x_i - \overline{x}|}{n}\)

18.6.1 Variance

Definition 18.14 The variance \(({\sigma}^2)\) is based on the difference between the value of each observation \({x_i}\) and the mean \({\overline{x}}\). The average of the squared deviations is called the variance.
  • Refer equation (18.11)
    • Sample Variance is denoted by \(s^2\) and Population Variance is denoted by \(\sigma^2\)
    • The variance is a measure of variability that utilizes all the data.
    • The difference between each \({x_i}\) and the mean (\(\overline{x}, \mu\)) is called a deviation about the mean i.e. (\(x_i - \overline{x}\)). Sum of deviation about the mean is always zero i.e. \(\sum (x_i - \overline{x}) =0\)
    • In the computation of the variance, the deviations about the mean are squared.
      • Because of the squaring involved, it is sensitive to the presence of outliers, leading analysts to prefer other measures of spread, such as the mean absolute deviation, in situations involving extreme values.

\[\begin{align} \sigma^2 &= \frac{1}{n} \sum _{i=1}^{n} \left(x_i - \mu \right)^2 \\ s^2 &= \frac{1}{n-1} \sum _{i=1}^{n} \left(x_i - \overline{x} \right)^2 \end{align} \tag{18.11}\]

18.6.2 Standard Deviation

Definition 18.15 The standard deviation (\(s, \sigma\)) is defined to be the positive square root of the variance. It is a measure of the amount of variation or dispersion of a set of values.
  • Refer equation (18.12)
    • Standard deviation for sample is denoted by \({s}\) and for Population by \({\sigma}\)
    • It can be interpreted as the “typical” distance between a field value and the mean
    • The coefficient of variation is a relative measure of variability. It measures the standard deviation relative to the mean. It is given in percentage as \(100 \times \sigma / \mu\)

\[\begin{align} \sigma &= \sqrt{\frac{1}{N} \sum_{i=1}^N \left(x_i - \mu\right)^2} \\ {s} &= \sqrt{\frac{1}{N-1} \sum_{i=1}^N \left(x_i - \overline{x}\right)^2} \end{align} \tag{18.12}\]

18.7 Measures of Distribution Shape

18.7.1 Skewness

Definition 18.16 Skewness \((\tilde{\mu}_{3})\) is a measure of the shape of a data distribution. It is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.
Definition 18.17 A tail refers to the tapering sides at either end of a distribution curve.
  • Data skewed to the left result in negative skewness; a symmetric data distribution results in zero skewness; and data skewed to the right result in positive skewness.
  • \(\tilde{\mu}_{3}\) is the \(3^{rd}\) standardized moment
    • Side topic: A standardized moment of a probability distribution is a moment (normally a higher degree central moment) that is normalized. The normalization is typically a division by an expression of the standard deviation which renders the moment scale invariant.
      • \(\tilde{\mu}_{1} = 0\), because the first moment about the mean is always zero.
      • \(\tilde{\mu}_{2} = 1\), because the second moment about the mean is equal to the variance \({\sigma}^2\).
      • \(\tilde{\mu}_{3}\) is a measure of skewness
      • \(\tilde{\mu}_{4}\) refers to the Kurtosis

Refer figure 18.1

  • No skew: (symmetric)
    • A unimodal distribution with zero value of skewness does not imply that this distribution is symmetric necessarily. However, a symmetric unimodal or multimodal distribution always has zero skewness. - The normal distribution has a skewness of zero. But reverse may not be true.
  • Negative skew: (left-skewed, left-tailed, or skewed to the left)
    • The left tail is longer, thus the ‘left’ refers to the left tail being drawn out
    • The curve itself appears to be leaning to the right i.e. the mass of the distribution is concentrated on the right of the figure
  • Positive skew: (right-skewed, right-tailed, or skewed to the right)
    • The right tail is longer, thus the ‘right’ refers to the right tail being drawn out
    • The curve itself appears to be leaning to the left i.e. the mass of the distribution is concentrated on the left of the figure
  • Relationship of mean and median
    • The skewness is not directly related to the relationship between the mean and median: a distribution with negative skew can have its mean greater than or less than the median, and likewise for positive skew
    • However, generally the skew can be calculated as \(({\mu} -{\nu})/\sigma\), where \({\nu}\) is median
  • Application:
    • Skewness indicates the direction and relative magnitude of deviation from the normal distribution.
    • It indicates the direction of outliers
    • With pronounced skewness, standard statistical inference procedures such as a confidence interval for a mean will be not only incorrect, in the sense that the true coverage level will differ from the nominal (e.g., 95%) level, but they will also result in unequal error probabilities on each side.

Skewness is given by the equation (18.13), which is being shown here because it looked cool has deep meaning

\[Skew = \frac{\tfrac {1}{n}\sum_{i=1}^{n}(x_{i}-{\overline{x}})^{3}}{\left[\tfrac {1}{n-1}\sum_{i=1}^{n}(x_{i}-{\overline{x}})^{2} \right]^{3/2}} \tag{18.13}\]

Charts

(Left Tail, Negative) Beta, Normal Distribution, Exponential (Positive, Right Tail)(Left Tail, Negative) Beta, Normal Distribution, Exponential (Positive, Right Tail)(Left Tail, Negative) Beta, Normal Distribution, Exponential (Positive, Right Tail)

Figure 18.1 (Left Tail, Negative) Beta, Normal Distribution, Exponential (Positive, Right Tail)

skewness()

# #Skewness Calculation: Package "e1071" (Package "moments" deprecated)
even_skew <- c(49, 50, 51)
pos_skew <- c(even_skew, 60)
neg_skew <- c(even_skew, 40)
skew_lst <- list(even_skew, pos_skew, neg_skew)
# #Mean, Median, SD
cat(paste0("Mean (neg, even, pos): ", 
           paste0(vapply(skew_lst, mean, numeric(1)), collapse = ", "), "\n"))
## Mean (neg, even, pos): 50, 52.5, 47.5
cat(paste0("Median (neg, even, pos): ", 
           paste0(vapply(skew_lst, median, numeric(1)), collapse = ", "), "\n"))
## Median (neg, even, pos): 50, 50.5, 49.5
cat(paste0("SD (neg, even, pos): ", paste0(
           round(vapply(skew_lst, sd, numeric(1)), 1), collapse = ", "), "\n"))
## SD (neg, even, pos): 1, 5.1, 5.1
#
cat(paste0("Skewness (neg, even, pos): ", paste0(
           round(vapply(skew_lst, e1071::skewness, numeric(1)), 1), collapse = ", "), "\n"))
## Skewness (neg, even, pos): 0, 0.7, -0.7
cat(paste0("Kurtosis (neg, even, pos): ", paste0(
           round(vapply(skew_lst, e1071::kurtosis, numeric(1)), 1), collapse = ", "), "\n"))
## Kurtosis (neg, even, pos): -2.3, -1.7, -1.7

Normal Exp Beta

# #Skewness Calculation: Package "e1071" (Package "moments" deprecated)
dis_lst <- list(xxNormal, xxExp, xxBeta)
#
# #Skewness: Normal has value close to 3 Kurtosis (=0 excess Kurtosis)
# #Skewness "e1071" has Type = 3 as default. Its Type = 1 matches "moments"
# #Practically, Normal has (small) NON-Zero Positive Skewness
skew_e_t3 <- vapply(dis_lst, e1071::skewness, numeric(1))
skew_e_t2 <- vapply(dis_lst, e1071::skewness, type = 2, numeric(1))
skew_e_t1 <- vapply(dis_lst, e1071::skewness, type = 1, numeric(1))
skew_mmt <-  vapply(dis_lst, moments::skewness, numeric(1))
stopifnot(identical(round(skew_e_t1, 10), round(skew_mmt, 10)))
cat(paste0("e1071: Type = 1 Skewness (Normal, Exp, Beta): ", 
           paste0(round(skew_e_t1, 4), collapse = ", "), "\n"))
## e1071: Type = 1 Skewness (Normal, Exp, Beta): 0.0407, 2.0573, -0.6279
cat(paste0("e1071: Type = 2 Skewness (Normal, Exp, Beta): ", 
           paste0(round(skew_e_t2, 4), collapse = ", "), "\n"))
## e1071: Type = 2 Skewness (Normal, Exp, Beta): 0.0407, 2.0576, -0.628
cat(paste0("e1071: Type = 3 Skewness (Normal, Exp, Beta): ", 
           paste0(round(skew_e_t3, 4), collapse = ", "), "\n"))
## e1071: Type = 3 Skewness (Normal, Exp, Beta): 0.0407, 2.057, -0.6278
#
# #Formula: (sigma_ (x_i - mu)^3) /(n * sd^3)
bb <- xxNormal
skew_man <- sum({bb - mean(bb)}^3) / {length(bb) * sd(bb)^3}
cat(paste0("(Manual) Skewness of Normal: ", round(skew_man, 4), 
           " (vs. e1071 Type 3 = ", round(skew_e_t3[1], 4), ") \n"))
## (Manual) Skewness of Normal: 0.0407 (vs. e1071 Type 3 = 0.0407)

Distributions

set.seed(3)
nn <- 10000L
# #Normal distribution is symmetrical
xxNormal <- rnorm(n = nn, mean = 0, sd = 1)
# #The exponential distribution is positive skew
xxExp <- rexp(n = nn, rate = 1)
# #The beta distribution with hyper-parameters α=5 and β=2 is negative skew
xxBeta <- rbeta(n = nn, shape1 = 5, shape2 = 2)
#
# #Save
f_setRDS(xxNormal)
f_setRDS(xxExp)
f_setRDS(xxBeta)
#f_getRDS(xxNormal)
# #Get the Distributions
xxNormal <- f_getRDS(xxNormal)
xxExp <- f_getRDS(xxExp)
xxBeta <- f_getRDS(xxBeta)

Density

# #Density Curve
# #Assumes 'hh' has data in 'ee'. In: cap_hh
#Basics
mean_hh <- mean(hh$ee)
sd_hh <- sd(hh$ee)
#
skew_hh <- skewness(hh$ee)
kurt_hh <- kurtosis(hh$ee)
# #Get Quantiles and Ranges of mean +/- sigma 
q05_hh <- quantile(hh[[1]], .05)
q95_hh <- quantile(hh[[1]], .95)
density_hh <- density(hh[[1]])
density_hh_tbl <- tibble(x = density_hh$x, y = density_hh$y)
sig3r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 3 * sd_hh})
sig3l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 3 * sd_hh})
sig2r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 2 * sd_hh}, {x < mean_hh + 3 * sd_hh})
sig2l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 2 * sd_hh}, {x > mean_hh - 3 * sd_hh})
sig1r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + sd_hh}, {x < mean_hh + 2 * sd_hh})
sig1l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - sd_hh}, {x > mean_hh - 2 * sd_hh})
sig0r_hh <- density_hh_tbl %>% filter(x > mean_hh, {x < mean_hh + 1 * sd_hh})
sig0l_hh <- density_hh_tbl %>% filter(x < mean_hh, {x > mean_hh - 1 * sd_hh})
#
# #Change x-Axis Ticks interval
xbreaks_hh <- seq(-3, 3)
xpoints_hh <- mean_hh + xbreaks_hh * sd_hh
#
# # Latex Labels 
xlabels_hh <- c(TeX(r'($\,\,\mu - 3 \sigma$)'), TeX(r'($\,\,\mu - 2 \sigma$)'), 
                TeX(r'($\,\,\mu - 1 \sigma$)'), TeX(r'($\mu$)'), TeX(r'($\,\,\mu + 1 \sigma$)'), 
                TeX(r'($\,\,\mu + 2 \sigma$)'), TeX(r'($\,\,\mu + 3\sigma$)'))
#
C03 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) + 
  geom_density(alpha = 0.2, colour = "#21908CFF") + 
  geom_area(data = sig3l_hh, aes(x = x, y = y), fill = '#440154FF') + 
  geom_area(data = sig3r_hh, aes(x = x, y = y), fill = '#440154FF') + 
  geom_area(data = sig2l_hh, aes(x = x, y = y), fill = '#3B528BFF') + 
  geom_area(data = sig2r_hh, aes(x = x, y = y), fill = '#3B528BFF') + 
  geom_area(data = sig1l_hh, aes(x = x, y = y), fill = '#21908CFF') + 
  geom_area(data = sig1r_hh, aes(x = x, y = y), fill = '#21908CFF') + 
  geom_area(data = sig0l_hh, aes(x = x, y = y), fill = '#5DC863FF') + 
  geom_area(data = sig0r_hh, aes(x = x, y = y), fill = '#5DC863FF') + 
  scale_x_continuous(breaks = xpoints_hh, labels = xlabels_hh) + 
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5), 
        axis.ticks = element_blank(), 
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(), 
        axis.line.y = element_blank(), axis.title.y = element_blank(), axis.text.y = element_blank()) + 
  labs(x = "x", y = "Density", 
       subtitle = paste0("Mean = ", round(mean_hh, 3), "; SD = ", round(sd_hh, 3), "; Skewness = ", round(skew_hh, 3), "; Kurtosis = ", round(kurt_hh, 3)), 
        caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, C03)
rm(C03)

18.7.2 Kurtosis

Definition 18.18 Kurtosis \((\tilde{\mu}_{4})\) is a measure of the “tailedness” of the probability distribution of a real-valued random variable. Like skewness, kurtosis describes the shape of a probability distribution. For \({\mathcal{N}}_{(\mu, \, \sigma)}\), kurtosis is 3 and excess kurtosis is 0 (i.e. subtract 3).

Distributions with zero excess kurtosis are called mesokurtic. The most prominent example of a mesokurtic distribution is the normal distribution. The kurtosis of any univariate normal distribution is 3.

Distributions with kurtosis less than 3 are said to be platykurtic. It means the distribution produces fewer and less extreme outliers than does the normal distribution. An example of a platykurtic distribution is the uniform distribution, which does not produce outliers.

Distributions with kurtosis greater than 3 are said to be leptokurtic. An example of a leptokurtic distribution is the Laplace distribution, which has tails that asymptotically approach zero more slowly than a Gaussian, and therefore produces more outliers than the normal distribution.

Kurtosis is the average (or expected value) of the standardized data raised to the fourth power. Any standardized values that are less than 1 (i.e., data within one standard deviation of the mean, where the “peak” would be), contribute virtually nothing to kurtosis, since raising a number that is less than 1 to the fourth power makes it closer to zero. The only data values that contribute to kurtosis in any meaningful way are those outside the region of the peak; i.e., the outliers. Therefore, kurtosis measures outliers only; it measures nothing about the “peak.”

The sample kurtosis is a useful measure of whether there is a problem with outliers in a data set. Larger kurtosis indicates a more serious outlier problem.

# #Kurtosis Calculation: Package "e1071" (Package "moments" deprecated)
dis_lst <- list(xxNormal, xxExp, xxBeta)
#
# #Kurtosis: Normal has value close to 3 Kurtosis (=0 excess Kurtosis)
# #Kurtosis "e1071" has Type = 3 as default. Its Type = 1 matches "moments" with difference of 3
kurt_e_t3 <- vapply(dis_lst, e1071::kurtosis, numeric(1))
kurt_e_t2 <- vapply(dis_lst, e1071::kurtosis, type = 2, numeric(1))
kurt_e_t1 <- vapply(dis_lst, e1071::kurtosis, type = 1, numeric(1))
kurt_mmt <-  vapply(dis_lst, moments::kurtosis, numeric(1))
stopifnot(identical(round(kurt_e_t1, 10), round(kurt_mmt - 3, 10)))
cat(paste0("e1071: Type = 1 Kurtosis (Normal, Exp, Beta): ", 
           paste0(round(kurt_e_t1, 4), collapse = ", "), "\n"))
## e1071: Type = 1 Kurtosis (Normal, Exp, Beta): -0.0687, 6.3223, -0.106
cat(paste0("e1071: Type = 2 Kurtosis (Normal, Exp, Beta): ", 
           paste0(round(kurt_e_t2, 4), collapse = ", "), "\n"))
## e1071: Type = 2 Kurtosis (Normal, Exp, Beta): -0.0682, 6.326, -0.1055
cat(paste0("e1071: Type = 3 Kurtosis (Normal, Exp, Beta): ", 
           paste0(round(kurt_e_t3, 4), collapse = ", "), "\n"))
## e1071: Type = 3 Kurtosis (Normal, Exp, Beta): -0.0693, 6.3204, -0.1066
#
# #Formula: (sigma_ (x_i - mu)^4) /(n * sd^4)
bb <- xxNormal
kurt_man <- {sum({bb - mean(bb)}^4) / {length(bb) * sd(bb)^4}} - 3
cat(paste0("(Manual) Kurtosis of Normal: ", round(kurt_man, 4), 
           " (vs. e1071 Type 3 = ", round(kurt_e_t3[1], 4), ") \n"))
## (Manual) Kurtosis of Normal: -0.0693 (vs. e1071 Type 3 = -0.0693)

18.8 Relative Location

18.8.1 z-Scores

Measures of relative location help us determine how far a particular value is from the mean. By using both the mean and standard deviation, we can determine the relative location of any observation.

Definition 18.19 A sample of \({n}\) observations given by \({X = \{{x}_1, {x}_2, \ldots, {x}_n\}}\) have a sample mean \({\overline{x}}\) and the sample standard deviation, \({s}\).
Definition 18.20 The z-score, \({z_i}\), can be interpreted as the number of standard deviations \({x_i}\) is from the mean \({\overline{x}}\). It is associated with each \({x_i}\). The z-score is often called the standardized value or standard score.
  • Refer equation (18.14) (Similar to equation (21.4))
    • For example, \(z_1 = 1.2\) would indicate that \({x_1}\) is 1.2 standard deviations greater than the sample mean. Similarly, \(z_2 = -0.5\) would indicate that \({x_2}\) is 0.5 standard deviation less than the sample mean.
    • A z-score greater than zero occurs for observations with a value greater than the mean, and a z-scoreless than zero occurs for observations with a value less than the mean.
    • A z-score of zero indicates that the value of the observation is equal to the mean.
    • The z-score for any observation can be interpreted as a measure of the relative location of the observation in a data set.
    • The process of converting a value for a variable to a z-score is often referred to as a z transformation or scaling.

\[z_i = \frac{{x}_i - {\overline{x}}}{{s}} \tag{18.14}\]

NOTE: “Z statistic” is a special case of “Z critical” because \(\sigma/\sqrt{n}\) is the ‘standard error of the sample mean’ which means that it is a standard deviation. Rather than (eg) a known population standard deviation or even just sample standard deviation, per CLT, it is the standard deviation of the sample mean. The ‘critical Z’ (i.e. standard score) is something than can always be computed (“a general case”) whenever there is a mean and standard deviation; it translates X into a Z variable with zero mean and unit variance. (it “imposes normality” when the data may not be normal!). The “Z statistic” similarly standardizes as a special case where it is standardizing the sample mean.

Definition 18.21 Computing a z-score requires knowing the mean \({\mu}\) and standard deviation \({\sigma}\) of the complete population to which a data point belongs. If one only has a sample of observations from the population, then the analogous computation with sample mean \({\overline{x}}\) and sample standard deviation \({s}\) yields the t-statistic.

Caution:

  • Scaling does influence the interpretation of the parameters when doing many statistical analyses (regression, PCA etc) so the decision to scale should be based on how you want to interpret your parameters.
    • Although the shapes of distributions are unchanged by scaling, the distributions themselves are definitely changed.
    • Ex: After scaling a Poisson distribution, it would no longer be a Poisson distribution
    • However, scaling will not change the underlying distribution of the variable nor will it influence (positively or negatively) the violations of model assumptions.
xxflights <- f_getRDS(xxflights)
bb <- na.omit(xxflights$air_time)
# Scaling
ii <- {bb - mean(bb)} / sd(bb)
str(ii)
##  num [1:327346] 0.8145 0.8145 0.0994 0.3449 -0.3702 ...
##  - attr(*, "na.action")= 'omit' int [1:9430] 472 478 616 644 726 734 755 839 840 841 ...
# #scale() gives a Matrix with original mean and sd as its attribute
jj <- scale(bb)
str(jj)
##  num [1:327346, 1] 0.8145 0.8145 0.0994 0.3449 -0.3702 ...
##  - attr(*, "scaled:center")= num 151
##  - attr(*, "scaled:scale")= num 93.7
stopifnot(identical(as.vector(ii), as.vector(jj)))
#
hh <- tibble(ee = as.vector(jj))
ttl_hh <- "Flights: Air Time (Scaled)"
cap_hh <- "C03P08" #iiii

Image

Before and After ScalingBefore and After Scaling

Figure 18.2 Before and After Scaling

Histogram

# #hh$ee ttl_hh cap_hh
#
C03 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) +
  geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') + 
  geom_vline(aes(xintercept = mean(.data[["ee"]])), color = '#440154FF') +
  annotate(geom = "text", x = mean(.[[1]]), y = -Inf, 
           label = TeX(r'($\bar{x}$)', output = "character"), 
           color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE) +
  coord_cartesian(ylim = c(0, 35000)) +
  theme(plot.title.position = "panel") + 
  labs(x = "x", y = "Frequency", 
       subtitle = paste0("(Mean= ", round(mean(.[[1]]), 3), 
                         "; SD= ", round(sd(.[[1]]), 3),
                         ")"), 
      caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, C03)
rm(C03)

Annotate

if(FALSE){
# #check_overlap = TRUE works for de-blurring. However, it still checks each point thus slow
geom_text(aes(label = TeX(r'($\bar{x}$)', output = "character"), 
              x = mean(.data[["ee"]]), y = -Inf),
          color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE, check_overlap = TRUE) 
# #Create your own dataset
geom_text(data = tibble(x = mean(.[[1]]), y = -Inf, 
                        label = TeX(r'($\bar{x}$)', output = "character")), 
          aes(x = x, y = y, label = label), 
          color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE ) 
# #Or Equivalent
ggplot2::annotate(geom = "text", x = mean(.[[1]]), y = -Inf, 
                  label = TeX(r'($\bar{x}$)', output = "character"), 
                  color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE) 
#
ggpp::annotate(geom = "text", x = mean(.[[1]]), y = -Inf, 
               label = TeX(r'($\bar{x}$)', output = "character"), 
               color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE) 
}

Colours

# #List All Colour Names in R
str(colors())
##  chr [1:657] "white" "aliceblue" "antiquewhite" "antiquewhite1" "antiquewhite2" "antiquewhite3" ...
# #Packages: viridis, scales, viridisLite
# #Show N Colours with Max. Contrast
q_colors <- 5
# #Display Colours
if(FALSE) show_col(viridis_pal()(q_colors))
# #Get the Viridis i.e. "D" palette Hex Values for N Colours
v_colors <-  viridis(q_colors, option = "D")
v_colors
## [1] "#440154FF" "#3B528BFF" "#21908CFF" "#5DC863FF" "#FDE725FF"
#
# #Diverging Colour Palette from 'RColorBrewer'
# #Hex Values
brewer.pal(3, "BrBG")
## [1] "#D8B365" "#F5F5F5" "#5AB4AC"
if(FALSE) display.brewer.pal(3, "BrBG")

18.8.2 Chebyshev Theorem

Definition 18.22 Chebyshev Theorem can be used to make statements about the proportion of data values that must be within a specified number of standard deviations \({\sigma}\), of the mean \({\mu}\).
  • Refer to 18.22
    • Chebyshev Theorem: At least \((1-1/z^2)\) of the data values must be within z standard deviations of the mean, where z is any value greater than 1.
      • Thus, at least 75% of the data values must be within \(\overline{x} \pm 2s\), 89% within \(\overline{x} \pm 3s\), and 94% \(\overline{x} \pm 4s\).
    • Chebyshev theorem can be applied to any data set regardless of the shape of the distribution of the data.
    • Ex: Test scores of 100 students have \((\mu = 70, \sigma = 5)\)
    • How many students had test scores between 60 and 80
      • From equation (18.14), \(z_{60} = \frac{60 - 70}{5} = -2\)
      • Similarly, \(z_{80} = \frac{80 - 70}{5} = +2\)
      • According to theorem 18.22, values that must be within \({z}\) standard deviation are
        • \({(1-1/z^2) = (1 - 1/2^2) = 0.75 = 75\%}\)
        • i.e. 75 students must have test scores between 60 and 80
    • How many students had test scores between 58 and 82
      • \(z_{58} = -2.4, z_{82} = +2.4\)
      • \({(1 - 1/2.4^2) \approx 0.826 \approx 83\%}\)
        • i.e. 83 students must have test scores between 58 and 82

18.8.3 Empirical Rule

Definition 18.23 Empirical rule is used to compute the percentage of data values that must be within one, two, and three standard deviations \({\sigma}\) of the mean \({\mu}\) for a normal distribution. These probabilities are Pr(x) 68.27%, 95.45%, and 99.73%.
  • According to the empirical rule, for a Normal distribution
    • \(Pr({\mu} - 1{\sigma} \leq {X} \leq {\mu} + 1{\sigma}) \approx 68.27\%\)
    • \(Pr({\mu} - 2{\sigma} \leq {X} \leq {\mu} + 2{\sigma}) \approx 95.45\%\) i.e. mostly
    • \(Pr({\mu} - 3{\sigma} \leq {X} \leq {\mu} + 3{\sigma}) \approx 99.73\%\) i.e. almost all data values

18.9 Outliers

Definition 18.24 Sometimes unusually large or unusually small values are called outliers. It is a data point that differs significantly from other observations.
  • Reasons
    • Outliers can occur by chance in any distribution, but they often indicate either measurement error or that the population has a heavy-tailed distribution. A frequent cause of outliers is a mixture of two distributions, which may be two distinct sub-populations.
    • In most larger samplings of data, some data points will be further away from the sample mean than what is deemed reasonable. However, a small number of outliers is to be expected (and not due to any anomalous condition).
    • Estimators capable of coping with outliers are said to be robust: the median is a robust statistic of central tendency, while the mean is not. However, the mean is generally a more precise estimator.
  • Keeping vs. Removing Outliers
    • data value that has been incorrectly recorded /included - should be removed
    • unusual data value that has been recorded correctly and belongs in the data set - should be kept
  • Standardized values (z-scores) can be used to identify outliers.
    • Empirical Rule allows us to conclude that for normal distribution, almost all the data values will be within three standard deviations of the mean \((\overline{x} \pm 3s)\).
    • Hence, in using z-scores to identify outliers, we recommend treating any data value with a z-score less than −3 or greater than +3 as an outlier.
    • Such data values can then be reviewed for accuracy and to determine whether they belong in the data set.
    • In the case of normally distributed data, the three sigma rule can be used to identify outliers.
      • In a sample of 1000 observations, the presence of up to five observations deviating from the mean by more than three times the standard deviation is within the range of what can be expected. If the sample size is only 100, however, just three such outliers are already reason for concern.
  • Another approach to identifying outliers is based upon IQR
    • \(\text{Lower Limit} = Q_1 - 1.5 \space \text{IQR}\) and \(\text{Upper Limit} = Q_3 + 1.5 \space \text{IQR}\)
    • An observation is classified as an outlier if its value is less than the lower limit or greater than the upper limit.

18.10 Summary

Five-Number Summary is used to quickly summarise a dataset. i.e. Min, Q1, Median, Q3, Max

  • A boxplot is a graphical display of data based on a five-number summary.
    • By using the interquartile range, IQR = Q3 − Q1, limits are located at 1.5(IQR) below Q1 and 1.5(IQR) above Q3
    • The whiskers are drawn from the ends of the box to the smallest and largest values inside the limits
    • Boxplots can also be used to provide a graphical summary of two or more groups and facilitate visual comparisons among the groups.

BoxPlot

geom_boxplot()

Figure 18.3 geom_boxplot()

Code

# #nycflights13::weather
bb <- weather
# #NA are present in the data
summary(bb$temp)
#
# #BoxPlot
C03P01 <- bb %>% drop_na(temp) %>% mutate(month = factor(month, ordered = TRUE)) %>% {
    ggplot(data = ., mapping = aes(x = month, y = temp)) +
    #geom_violin() +
    geom_boxplot(aes(fill = month), outlier.colour = 'red', notch = TRUE) +
    stat_summary(fun = mean, geom = "point", size = 2, color = "steelblue") + 
    scale_y_continuous(breaks = seq(0, 110, 10), limits = c(0, 110)) +
    #geom_point() +
    #geom_jitter(position=position_jitter(0.2)) +
    k_gglayer_box +
    theme(legend.position = 'none') +
    labs(x = "Months", y = "Temperature", subtitle = "With Mean & Notch", 
         caption = "C03P01", title = "BoxPlot")
}

18.11 Relationship between Two Variables

18.11.1 Covariance

Definition 18.25 Covariance is a measure of linear association between two variables. Positive values indicate a positive relationship; negative values indicate a negative relationship.
  • Refer equation (18.15)
    • For a sample of size \({n}\) with the observations \((x_1, y_1), (x_2, y_2)\), and so on, the covariance is given by equation (18.15)
    • A positive value for \(s_{xy}\) indicates a positive linear association between x and y; that is, as the value of x increases, the value of y increases. Similarly a negative value shows a negative linear association.
      • In the example, \(s_{xy} = 11\)
    • If the points are evenly distributed in the scatterplot, the value of \(s_{xy}\) will be close to zero, indicating no linear association between x and y.
    • Caution: Problem with using covariance as a measure of the strength of the linear relationship is that the value of the covariance depends on the units of measurement for x and y.

\[\begin{align} \sigma_{xy} &= \frac{\sum (x_i - \mu_x)(y_i - \mu_y)}{n} \\ s_{xy} &= \frac{\sum (x_i - \overline{x})(y_i - \overline{y})}{n-1} \end{align} \tag{18.15}\]

Scatter Plot Quadrants for CovarianceScatter Plot Quadrants for Covariance

Figure 18.4 Scatter Plot Quadrants for Covariance

Covariance

# #Get 'Deviation about the mean' i.e. devX and devY and their Product devXY
ii <- bb %>% 
  mutate(devX = Commercials - mean(Commercials), devY = Sales - mean(Sales), devXY = devX * devY) 
#
# #Sample Covariance
sxy <- sum(ii$devXY) / {length(ii$devXY) -1}
print(sxy)
## [1] 11

Code

bb <- f_getRDS(xxCommercials) 

# #Define the formula for Trendline calculation
k_gg_formula <- y ~ x
#
# #Scatterplot, Trendline Equation, R2, mean x & y
C03P02 <- bb %>% {
  ggplot(data = ., aes(x = Commercials, y = Sales)) + 
  geom_smooth(method = 'lm', formula = k_gg_formula, se = FALSE) +
  stat_poly_eq(aes(label = paste0("atop(", ..eq.label.., ", \n", ..rr.label.., ")")), 
               formula = k_gg_formula, eq.with.lhs = "italic(hat(y))~`=`~",
               eq.x.rhs = "~italic(x)", parse = TRUE) +
  geom_vline(aes(xintercept = round(mean(Commercials), 3)), color = 'red', linetype = "dashed") +
  geom_hline(aes(yintercept = round(mean(Sales), 3)), color = 'red', linetype = "dashed") +
  geom_text(aes(label = TeX(r"($\bar{x} = 3$)", output = "character"), 
                x = round(mean(Commercials), 3), y = -Inf), 
            color = 'red', , hjust = -0.2, vjust = -0.5, parse = TRUE, check_overlap = TRUE) + 
  geom_text(aes(label = TeX(r"($\bar{y} = 51$)", output = "character"), 
                x = Inf, y = round(mean(Sales), 3)), 
            color = 'red', , hjust = 1.5, vjust = -0.5, parse = TRUE, check_overlap = TRUE) + 
  geom_point() +
  k_gglayer_scatter +
  labs(x = "Commercials", y = "Sales ($100s)",
       subtitle = TeX(r"(Trendline Equation, $R^{2}$, $\bar{x}$ and $\bar{y}$)"), 
       caption = "C03P02", title = "Scatter Plot")
}

More Text

  • This
    • Unlike Pearson correlation, covariance itself is not a measure of the magnitude of linear relationship. It is a measure of co-variation (which could be just monotonic). This is because covariance depends not only on the strength of linear association but also on the magnitude of the variances.
  • More details are in following links

18.11.2 Correlation Coefficient

Definition 18.26 Correlation coefficient is a measure of linear association between two variables that takes on values between −1 and +1. Values near +1 indicate a strong positive linear relationship; values near −1 indicate a strong negative linear relationship; and values near zero indicate the lack of a linear relationship.
  • Refer equation (18.16) & Table 18.1
    • The ‘Pearson Product Moment Correlation Coefficient’ or sample correlation coefficient is computed by dividing the sample covariance \(s_{xy}\) by the product of the sample standard deviation of x (\(s_{x}\)) and the sample standard deviation of y (\(s_{y}\)).
      • Values close to −1 (negative) or +1 (positive) indicate a strong linear relationship. The closer the correlation is to zero, the weaker the relationship.
    • In the example, \(s_{xy} = 11\) (Equation (18.15)) and \(s_{x} = 1.49\), \(s_{y} = 7.93\) (Equation (18.12))
    • Thus, \(r_{xy} = 0.93\)
    • Caution: Correlation provides a measure of linear association and not necessarily causation. A high correlation between two variables does not mean that changes in one variable will cause changes in the other variable.
    • Caution: Because the correlation coefficient measures only the strength of the linear relationship between two quantitative variables, it is possible for the correlation coefficient to be near zero, suggesting no linear relationship, when the relationship between the two variables is nonlinear.

\[\begin{align} \rho_{xy} &= \frac{\sigma_{xy}}{\sigma_{x}\sigma_{y}} \\ r_{xy} &= \frac{s_{xy}}{s_{x}s_{y}} \end{align} \tag{18.16}\]

Correlation

# #Get 'Deviation about the mean' i.e. devX and devY and their Product devXY
ii <- bb %>% 
  mutate(devX = Commercials - mean(Commercials), devY = Sales - mean(Sales), devXY = devX * devY) 
#
# #Sample Covariance
sxy <- sum(ii$devXY) / {length(ii$devXY) -1}
print(sxy)
## [1] 11
jj <- ii %>% mutate(devXsq = devX * devX, devYsq = devY * devY)
# #Sample Covariance Sx, Sample Standard Deviations Sx Sy
sxy <- sum(ii$devXY) / {nrow(ii) -1}
sx <- round(sqrt(sum(jj$devXsq) / {nrow(jj) -1}), 2)
sy <- round(sqrt(sum(jj$devYsq) / {nrow(jj) -1}), 2)
cat(paste0("Sxy =", sxy, ", Sx =", sx, ", Sy =", sy, "\n"))
## Sxy =11, Sx =1.49, Sy =7.93
#
# #Correlation Coefficient Rxy
rxy <- round(sxy / {sx * sy}, 2)
cat(paste0("Correlation Coefficient Rxy =", rxy, "\n"))
## Correlation Coefficient Rxy =0.93

Data

Table 18.1: (C03T01) Correlation Calculation
Week Commercials Sales devX devY devXY devXsq devYsq
1 2 50 -1 -1 1 1 1
2 5 57 2 6 12 4 36
3 1 41 -2 -10 20 4 100
4 3 54 0 3 0 0 9
5 4 54 1 3 3 1 9
6 1 38 -2 -13 26 4 169
7 5 63 2 12 24 4 144
8 3 48 0 -3 0 0 9
9 4 59 1 8 8 1 64
10 2 46 -1 -5 5 1 25

Validation


19 Probability

19.1 Overview

  • This chapter covers Probability, Factorial, Combinations, Permutations, Bayes Theorem.

19.2 Probability

Definition 19.1 Probability is a numerical measure of the likelihood that an event will occur. Probability values are always assigned on a scale from 0 to 1. A probability near zero indicates an event is unlikely to occur; a probability near 1 indicates an event is almost certain to occur.
Definition 19.2 A random experiment is a process that generates well-defined experimental outcomes. On any single repetition or trial, the outcome that occurs is determined completely by chance.
Definition 19.3 The sample space for a random experiment is the set of all experimental outcomes.
  • Random experiment of tossing a coin has a Sample Space \(S = \{\text{Head}, \text{Tail}\}\)
  • Random experiment of rolling a die has a Sample Space \(S = \{1, 2, 3, 4, 5, 6\}\)
  • Random experiment of tossing Two coins has a Sample Space \(S = \{\text{HH}, \text{HT}, \text{TH}, \text{TT}\}\)

19.3 Counting Rule

Definition 19.4 Counting Rule for Multiple-Step Experiments: If an experiment can be described as a sequence of \({k}\) steps with \({n_1}\) possible outcomes on the first step, \({n_2}\) possible outcomes on the second step, and so on, then the total number of experimental outcomes is given by \(\{(n_1)(n_2) \cdots (n_k) \}\)
Definition 19.5 A tree diagram is a graphical representation that helps in visualizing a multiple-step experiment.

19.4 Factorial

Definition 19.6 The factorial of a non-negative integer \({n}\), denoted by \(n!\), is the product of all positive integers less than or equal to n. The value of 0! is 1 i.e. \(0!=1\)

\[\begin{align} n! &= \prod _{i=1}^n i = n \cdot (n-1) \\ &= n \cdot(n-1)\cdot(n-2)\cdot(n-3)\cdot\cdots \cdot 3 \cdot 2 \cdot 1 \end{align} \tag{19.1}\]

19.5 Combinations

Definition 19.7 Combination allows one to count the number of experimental outcomes when the experiment involves selecting \({k}\) objects from a set of \({N}\) objects. The number of combinations of \({N}\) objects taken \({k}\) at a time is equal to the binomial coefficient \(C_k^N\)

\[C_k^N = \binom{N}{k} = \frac{N!}{k!(N-k)!} \tag{19.2}\]

19.6 Permutations

Definition 19.8 Permutation allows one to compute the number of experimental outcomes when \({k}\) objects are to be selected from a set of \({N}\) objects where the order of selection is important. The same \({k}\) objects selected in a different order are considered a different experimental outcome. The number of permutations of \({N}\) objects taken \({k}\) at a time is given by \(P_k^N\)

\[P_k^N = k! \binom{N}{k} = \frac{N!}{(N-k)!} \tag{19.3}\]

  • The number of permutations of \({k}\) distinct objects is \(k!\)
    • An experiment results in more permutations than combinations for the same number of objects because every selection of \({k}\) objects can be ordered in \(k!\) different ways.

19.7 Assigning Probabilities

  • Basic Requirements (Similar to the Discrete Probability & Continuos Probability)
    1. The probability assigned to each experimental outcome must be between 0 and 1, inclusively. If we let \({E_i}\) denote the \(i^{th}\) experimental outcome and \(P(E_i)\) its probability, then \(P(E_i) \in [0, 1]\)
    2. The sum of the probabilities for all the experimental outcomes must equal 1. Thus for \({k}\) experimental outcomes \(\sum _{i=1}^k P(E_i) =1\)
Definition 19.9 An event is a collection of sample points. The probability of any event is equal to the sum of the probabilities of the sample points in the event. The sample space, \({s}\), is an event. Because it contains all the experimental outcomes, it has a probability of 1; that is, \(P(S) = 1\)
Definition 19.10 Given an event \({A}\), the complement of A (\(A^c\)) is defined to be the event consisting of all sample points that are not in A. Thus, \(P(A) + P(A^{c}) =1\)
Definition 19.11 Given two events A and B, the union of A and B is the event containing all sample points belonging to A or B or both. The union is denoted by \(A \cup B\)
Definition 19.12 Given two events A and B, the intersection of A and B is the event containing the sample points belonging to both A and B. The intersection is denoted by \(A \cap B\)
  • Refer to the Addition Law in the equation (19.4)

\[P(A \cup B) = P(A) + P(B) - P(A \cap B) \tag{19.4}\]

Definition 19.13 Two events are said to be mutually exclusive if the events have no sample points in common. Thus, \(A \cap B = 0\)

19.8 Exercises

  • How many ways can three items be selected from a group of six items
    • Solution: \(C_{3}^{6} = 6!/3!3! = 120\)
  • In a experiment of tossing a coin three times, how many experimental outcomes can be
    • Solution: \(2^{3} = 8\)
  • Simple random sampling uses a sample of size k from a population of size N to obtain data that can be used to make inferences about the characteristics of a population. Suppose that, from a population of 50 bank accounts, we want to take a random sample of four accounts in order to learn about the population. How many different random samples of four accounts are possible
    • Solution: \(C_{4}^{50} = 50!/4!46!\)
  • To play Powerball, a participant must select five numbers from the digits 1 through 59, and then select a Powerball number from the digits 1 through 35. To determine the winning numbers for each game, lottery officials draw 5 white balls out a drum of 59 white balls numbered 1 through 59 and 1 red ball out of a drum of 35 red balls numbered 1 through 35. To win the Powerball jackpot, numbers on the lottery must match the numbers on the 5 white balls in any order and must also match the number on the red Powerball. How many Powerball lottery outcomes are possible
    • Solution: \(C_{5}^{59} \times C_{1}^{35}\)
  • An experiment has four equally likely outcomes: E1, E2, E3, and E4
    • What is the probability that E2 occurs
      • Solution: \({1/4}\)
    • What is the probability that any two of the outcomes occur (e.g., E1 or E3)
      • Solution: \(2/4 = 1/2\)
    • What is the probability that any three of the outcomes occur (e.g., E1 or E2 or E4)
      • Solution: \({3/4}\)
  • Consider the experiment of selecting a playing card from a deck of 52 playing cards. Each card corresponds to a sample point with a 1/52 probability.
    • Probability of the event that an ace is selected
      • Solution: \(4/52 = 1/13\)
    • Probability of the event that a club is selected
      • Solution: \(13/52 = 1/4\)
    • Probability of the event that a face card (jack, queen, or king) is selected
      • Solution: \(4\times3/52\)
  • Consider the experiment of rolling a pair of dice. Suppose that we are interested in the sum of the face values showing on the dice.
    • How many sample points are possible
      • Solution: \(6 \times 6 = 36\)
    • What is the probability of obtaining a value of 7
      • Solution: \(E_{7} = \{(1,6), (6,1), (2,5), (5,2), (3,4), (4,3)\} \Rightarrow P(E_{7}) = 6/36 = 1/6\)
    • What is the probability of obtaining a value of 9 or greater
      • Solution: \(P(E_{\geq9}) = P(E_{9}, E_{10}, E_{11}, E_{12}) = \frac{4 + 3 + 2 + 1}{36} = \frac{5}{18}\)
    • Because each roll has six possible even values (2, 4, 6, 8, 10, and 12) and only five possible odd values (3, 5, 7, 9, and 11), the dice should show even values more often than odd values. Do you agree with this statement
      • Solution: \(\text{NO: } P(E_{\text{odd}}) = P(E_{\text{even}}) = 1/2 \iff E_{\text{odd}} = E_{\text{even}} = 18\)
  • A survey of magazine subscribers showed that 45.8% rented a car during the past 12 months for business reasons, 54% rented a car during the past 12 months for personal reasons, and 30% rented a car during the past 12 months for both business and personal reasons.
    • Let B denote Business, P denote Personal
    • What is the probability that a subscriber rented a car during the past 12 months for business or personal reasons
      • Solution: \(P(B \cup P) = P(B) + P(P) - P(B \cap P) = 0.458 + 0.540 - 0.3 = 0.698\)
    • What is the probability that a subscriber did not rent a car during the past 12 months for either business or personal reasons
    • Solution: \(P(B \cup P)^{c} = 1 - 0.698 = 0.302\)

19.9 Conditional Probability

Definition 19.14 Conditional probability is the probability of an event given that another event already occurred. The conditional probability of ‘A given B’ is \(P(A|B) = \frac{P(A \cup B)}{P(B)}\)
Table 19.1: (C04T01) Police: Promotion and Gender
Promo_Gender Men Women SUM
Promoted 288 36 324
NotPromoted 672 204 876
Total 960 240 1200
Table 19.1: (C04T01A) Joint and Marginal Probabilities
Promo_Gender Men Women SUM
Promoted 0.24 0.03 0.27
NotPromoted 0.56 0.17 0.73
Total 0.80 0.20 1.00
  • Refer to the Police Promotion Table 19.1
    • Let, M (Man), W (Woman), A (Promoted), \(A^{c}\) (Not Promoted)
    • Probability that a randomly selected officer …
      • is man and is promoted: \(P(A \cap M) = 288/1200 = 0.24\)
      • is woman and is promoted: \(P(A \cap W) = 36/1200 = 0.03\)
      • is man and is not promoted: \(P(A^{c} \cap M) = 672/1200 = 0.56\)
      • is woman and is not promoted: \(P(A^{c} \cap W) = 204/1200 = 0.17\)
      • NOTE: Each of these are Joint Probabilities because these provide intersection of two events.
    • Marginal probabilities are the values in the margins of the joint probability table and indicate the probabilities of each event separately.
      • \(P(M) = 0.80, P(W) = 0.20, P(A) = 0.27, P(A^{c}) = 0.73\)
      • Ex: the marginal probability of being promoted is \(P(A) = P(A \cap M) + P(A \cap W)\)
    • Conditional Probability Analysis
      • “the probability that an officer is promoted given that the officer is a man” \(P(A|M)\)
        • \(P(A|M) = 288/960 = 0.30\)
        • OR \(P(A|M) = P(A \cap M) / P(M) = 0.24/0.80 = 0.30\)
        • “Given that an officer is a man, that officer had a 30% chance of receiving a promotion”
      • “the probability that an officer is promoted given that the officer is a woman” \(P(A|W)\)
        • \(P(A|W) = P(A \cap W) / P(W) = 0.03/0.20 = 0.15\)
        • “Given that an officer is a woman, that officer had a 15% chance of receiving a promotion”
      • Conclusion
        • The probability of a promotion given that the officer is a man is .30, twice the .15 probability of a promotion given that the officer is a woman.
        • Although the use of conditional probability does not in itself prove that discrimination exists in this case, the conditional probability values do support this argument.
Definition 19.15 Two events A and B are independent if \(P(A|B) = P(A) \quad \text{OR} \quad P(B|A) = P(B) \Rightarrow P(A \cap B) = P(A) \cdot P(B)\)
  • Refer to the Multiplication Law in the equation (19.5)
    • Example: 84% of the households in a neighborhood subscribe to the daily edition of a newspaper; that is, \(P(D) =0.84\). In addition, it is known that the probability that a household that already holds a daily subscription also subscribes to the Sunday edition is .75; that is, \(P(S|D) =0.75\)
      • What is the probability that a household subscribes to both the Sunday and daily editions of the newspaper
        • \(P(S \cap D) = P(D) \cdot P(S|D) = 0.84 \times 0.75 = 0.63\)
        • “63% of the households subscribe to both the Sunday and daily editions”

\[\begin{align} P(A \cap B) &= P(B) \cdot P(A | B) \\ &= P(A) \cdot P(B | A) \end{align} \tag{19.5}\]

  • Mutually Exclusive vs. Independent Events
    • Two events with nonzero probabilities cannot be both mutually exclusive and independent.
    • If one mutually exclusive event is known to occur, the other cannot occur; thus, the probability of the other event occurring is reduced to zero. They are therefore dependent.

19.10 Bayes Theorem

Often, we begin the analysis with initial or prior probability estimates for specific events of interest. Then, from sources such as a sample, a special report, or a product test, we obtain additional information about the events. Given this new information, we update the prior probability values by calculating revised probabilities, referred to as posterior probabilities. Bayes theorem provides a means for making these probability calculations.

  • Refer to the equation (19.6)
    • Bayes theorem is applicable when the events for which we want to compute posterior probabilities are mutually exclusive and their union is the entire sample space.
      • An event, \(P(A)\), and its complement, \(P(A^{c})\), are mutually exclusive, and their union is the entire sample space. Thus, Bayes theorem is always applicable for computing posterior probabilities of an event and its complement.
    • Example: A firm has two suppliers, currently 65% parts are supplied by one and remaining by other; that is, \(P(A_{1}) = 0.65, P(A_{2}) = 0.35\). Quality of products supplied is 98% Good for supplier one and 95% Good for supplier 2.
      • \(P(G|A_{1}) = 0.98, P(B|A_{1}) = 0.02\)
      • \(P(G|A_{2}) = 0.95, P(B|A_{2}) = 0.05\)
      • Given that we received a Bad Part, what is the probability that it came from supplier 2
        • \(P(A_{2}|B) = \frac{P(A_{2})P(B|A_{2})}{P(A_{1}) P(B|A_{1})+ P(A_{2}) P(B|A_{2})} = \frac{0.35 \times 0.05}{0.65 \times 0.02 + 0.35 \times 0.05} = 0.5738 \approx 57\%\)
        • Similarly, \(P(A_{1}|B) = 0.4262 \approx 43\%\)
      • NOTE: While the Probability of a random part being from supplier 1 is \(P(A_{1}) = 0.65\), it is reduced to \(P(A_{1}|B) = 0.4262 \approx 43\%\) as we have received new information that the part is Bad.

\[\begin{align} P(A_{1}|B) &= \frac{P(A_{1})P(B|A_{1})}{P(A_{1}) P(B|A_{1})+ P(A_{2}) P(B|A_{2})} \\ P(A_{2}|B) &= \frac{P(A_{2})P(B|A_{2})}{P(A_{1}) P(B|A_{1})+ P(A_{2}) P(B|A_{2})} \end{align} \tag{19.6}\]

Validation


20 Discrete Probability Distributions

20.1 Overview

20.2 Definitions (Ref)

16.13 Quantitative data that measure ‘how many’ are discrete.

16.14 Quantitative data that measure ‘how much’ are continuous because no separation occurs between the possible data values.

20.3 Random Variable

Definition 20.1 A random variable is a numerical description of the outcome of an experiment. Random variables must assume numerical values. It can be either ‘discrete’ or ‘continuous.’
Definition 20.2 A random variable that may assume either a finite number of values or an infinite sequence of values such as \(0, 1, 2, \dots\) is referred to as a discrete random variable. It includes factor type i.e. Male as 0, Female as 1 etc.
Definition 20.3 A random variable that may assume any numerical value in an interval or collection of intervals is called a continuous random variable. It is given by \(x \in [n, m]\). If the entire line segment between the two points also represents possible values for the random variable, then the random variable is continuous.

20.4 Discrete Probability Distributions

Definition 20.4 The probability distribution for a random variable describes how probabilities are distributed over the values of the random variable.
Definition 20.5 For a discrete random variable x, a probability function \(f(x)\), provides the probability for each value of the random variable.
  • The use of the relative frequency method to develop discrete probability distributions leads to what is called an empirical discrete distribution.
    • We treat the data as if they were the population and use the relative frequency method to assign probabilities to the experimental outcomes.
    • The distribution of data is how often each observation occurs, and can be described by its central tendency and variation around that central tendency.
  • Basic Requirements (Similar to the Probability Basics & Continuos Probability)
    1. \(f(x) \geq 0\)
    2. \(\sum {f(x)} = 1\)
  • The simplest example of a discrete probability distribution given by a formula is the discrete uniform probability distribution; \(f(x) = 1/n\), where \({n}\) is the number of values the random variable may assume
    • Each possible value of the random variable has the same probability

20.4.1 Expected Value

Definition 20.6 The expected value, or mean, of a random variable is a measure of the central location for the random variable. i.e. \(E(x) = \mu = \sum xf(x)\)
  • NOTE
    • The expected value is a weighted average of the values of the random variable where the weights are the probabilities.
    • The expected value does not have to be a value the random variable can assume. i.e. average need not to be integer

20.4.2 Variance

Definition 20.7 The variance is a weighted average of the squared deviations of a random variable from its mean. The weights are the probabilities. i.e. \(\text{Var}(x) = \sigma^2 = \sum \{(x- \mu)^2 \cdot f(x)\}\)

20.5 Bivariate Distributions

Definition 20.8 A probability distribution involving two random variables is called a bivariate probability distribution. A discrete bivariate probability distribution provides a probability for each pair of values that may occur for the two random variables.
  • NOTE:
    • Each outcome for a bivariate experiment consists of two values, one for each random variable. Example: Rolling a pair of dice
    • Bivariate probabilities are often called joint probabilities

20.6 Ex Dicarlo

Table

Table 20.1: (C05T04) Variance Calculation
\({x}\) \(f(x)\) \(\sum xf(x)\) \((x - \mu)\) \((x - \mu)^2\) \(\sum {(x - \mu)^{2}f(x)}\)
0 0.18 0 -1.5 2.25 0.405
1 0.39 0.39 -0.5 0.25 0.0975
2 0.24 0.48 0.5 0.25 0.06
3 0.14 0.42 1.5 2.25 0.315
4 0.04 0.16 2.5 6.25 0.25
5 0.01 0.05 3.5 12.25 0.1225
Total 1.00 mu = 1.5 NA NA sigma^2 = 1.25

Data

# #Dicarlo: Days with Number of Cars Sold per day for last 300 days
xxdicarlo <- tibble(Cars = 0:5, Days = c(54, 117, 72, 42, 12, 3))
#
bb <- xxdicarlo
bb <- bb %>% rename(x = Cars, Fx = Days) %>% mutate(across(Fx, ~./sum(Fx))) %>% 
  mutate(xFx = x * Fx, x_mu = x - sum(xFx), 
             x_mu_sq = x_mu * x_mu, x_mu_sq_Fx = x_mu_sq * Fx) 
R_dicarlo_var_y_C05 <- sum(bb$x_mu_sq_Fx)
# #Total Row
bb <- bb %>% 
  mutate(across(1, as.character)) %>% 
  add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.double), sum))) %>% 
  mutate(xFx = ifelse(x == "Total", paste0("mu = ", xFx), xFx),
         x_mu_sq_Fx = ifelse(x == "Total", paste0("sigma^2 = ", x_mu_sq_Fx), x_mu_sq_Fx)) %>% 
  mutate(across(4:5, ~ replace(., x == "Total", NA)))

Change Class

# #Change Column Classes as required
bb %>% mutate(across(1, as.character))
bb %>% mutate(across(everything(), as.character))

Modify Value

bb <- xxdicarlo
ii <- bb %>% rename(x = Cars, Fx = Days) %>% mutate(across(Fx, ~./sum(Fx))) %>% 
  mutate(xFx = x * Fx, x_mu = x - sum(xFx), 
             x_mu_sq = x_mu * x_mu, x_mu_sq_Fx = x_mu_sq * Fx) 
# #Add Total Row
ii <- ii %>% 
  mutate(across(1, as.character)) %>% 
  add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.double), sum))) 
#
# #Modify Specific Row Values without using filter() 
# #filter() does not have 'un-filter()' function like group()-ungroup() combination
# #Selecting Row where x = "Total" and changing Column Values for Two Columns
ii <- ii %>% 
  mutate(xFx = ifelse(x == "Total", paste0("mu = ", xFx), xFx),
       x_mu_sq_Fx = ifelse(x == "Total", paste0("sigma^2 = ", x_mu_sq_Fx), x_mu_sq_Fx)) 
#
# #Selecting Row where x = "Total" and doing same replacement on Two Columns
ii %>% mutate(across(4:5, function(y) replace(y, x == "Total", NA)))
ii %>% mutate(across(4:5, ~ replace(., x == "Total", NA)))

20.7 Ex Dicarlo GS

Table

Table 20.2: (C05T01) Bivariate Table
Geneva_Saratoga y0 y1 y2 y3 y4 y5 SUM
x0 21 30 24 9 2 0 86
x1 21 36 33 18 2 1 111
x2 9 42 9 12 3 2 77
x3 3 9 6 3 5 0 26
Total 54 117 72 42 12 3 300
Table 20.2: (C05T02) Probability Distribution
Geneva_Saratoga y0 y1 y2 y3 y4 y5 SUM
x0 0.07 0.10 0.08 0.03 0.007 0.000 0.29
x1 0.07 0.12 0.11 0.06 0.007 0.003 0.37
x2 0.03 0.14 0.03 0.04 0.010 0.007 0.26
x3 0.01 0.03 0.02 0.01 0.017 0.000 0.09
Total 0.18 0.39 0.24 0.14 0.040 0.010 1.00

DataGS

xxdicarlo_gs <- tibble(Geneva_Saratoga = c("x0", "x1", "x2", "x3"), 
             y0 = c(21, 21, 9, 3), y1 = c(30, 36, 42, 9), y2 = c(24, 33, 9, 6), 
             y3 = c(9, 18, 12, 3), y4 = c(2, 2, 3, 5), y5 = c(0, 1, 2, 0))
bb <- xxdicarlo_gs
#
# #Tibble Total SUM 
sum_bb <- bb %>% summarise(across(-1, sum)) %>% summarise(sum(.)) %>% pull(.)
#
# #Add Total Row and SUM Column
ii <- bb %>% 
  mutate(across(1, as.character)) %>% 
  add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.numeric), sum))) %>% 
  mutate(SUM = rowSums(across(where(is.numeric))))
#
# #Convert to Bivirate Probability Distribution and then add Total Row and SUM Column
jj <- bb %>% 
  mutate(across(where(is.numeric), ~./sum_bb)) %>% 
  mutate(across(1, as.character)) %>% 
  add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.numeric), sum))) %>% 
  mutate(SUM = rowSums(across(where(is.numeric)))) %>% 
  mutate(across(where(is.numeric), format, digits =1))

Tibble Total SUM

bb <- xxdicarlo_gs
# #Assuming there is NO Total Column NOR Total Row and First Column is character
kk <- bb %>% summarise(across(where(is.numeric), sum)) %>% summarise(sum(.)) %>% pull(.)
ll <- bb %>% summarise(across(-1, sum)) %>% summarise(sum(.)) %>% pull(.)
stopifnot(identical(kk, ll))
print(kk)
## [1] 300

format()

bb <- xxdicarlo_gs
# #Round off values to 1 significant digits i.e. 0.003 or 0.02
# #NOTE: This changes the column to "character"
bb %>% mutate(across(where(is.numeric), ~./sum_bb)) %>% 
  mutate(across(where(is.numeric), format, digits =1))
## # A tibble: 4 x 7
##   Geneva_Saratoga y0    y1    y2    y3    y4    y5   
##   <chr>           <chr> <chr> <chr> <chr> <chr> <chr>
## 1 x0              0.07  0.10  0.08  0.03  0.007 0.000
## 2 x1              0.07  0.12  0.11  0.06  0.007 0.003
## 3 x2              0.03  0.14  0.03  0.04  0.010 0.007
## 4 x3              0.01  0.03  0.02  0.01  0.017 0.000

20.8 Bivariate …

  • Suppose we would like to know the probability distribution for total sales at both DiCarlo dealerships and the expected value and variance of total sales.
    • We can define \(s = x + y\) as Total Sales.
    • Refer to the Tables 20.2 and 20.3
      • \(f(s_0) = f(x_0, y_0) = 0.07\)
      • \(f(s_1) = f(x_0, y_1) + f(x_1, y_0) = 0.10 + 0.07 = 0.17\)

Table

Table 20.3: (C05T03) Bivariate Expected Value and Variance
\(ID\) \({s}\) \(f(s)\) \(\sum sf(s)\) \((s - E(s))\) \((s - E(s))^2\) \(\sum {(s - E(s))^{2}f(s)}\)
A 0 0.070 0.00 -2.64 6.99 0.489
B 1 0.170 0.17 -1.64 2.70 0.459
C 2 0.230 0.46 -0.64 0.41 0.095
D 3 0.290 0.87 0.36 0.13 0.037
E 4 0.127 0.51 1.36 1.84 0.233
F 5 0.067 0.33 2.36 5.55 0.370
G 6 0.023 0.14 3.36 11.27 0.263
H 7 0.023 0.16 4.36 18.98 0.443
I 8 0.000 0.00 5.36 28.69 0.000
Total NA 1.000 E(s) = 2.64 NA NA Var(s) = 2.389

Code

bb <- xxdicarlo_gs
sum_bb <- bb %>% summarise(across(-1, sum)) %>% summarise(sum(.)) %>% pull(.)
# #Convert to Bivariate Probability Distribution
ii <- bb %>% mutate(across(where(is.numeric), ~./sum_bb)) %>% select(-1)
# #Using tapply(), sum the Matrix
jj <- tapply(X= as.matrix(ii), INDEX = LETTERS[row(ii) + col(ii)-1], FUN = sum)
# #Create Tibble
kk <- tibble(Fs = jj, ID = LETTERS[1:length(Fs)], s = 1:length(Fs) - 1) %>% 
  relocate(Fs, .after = last_col()) %>% 
  mutate(sFs = s * Fs, s_Es = s - sum(sFs), 
             s_Es_sq = s_Es * s_Es, s_Es_sq_Fs = s_Es_sq * Fs) 
# #Save for Notebook
R_dicarlo_var_s_C05 <- sum(kk$s_Es_sq_Fs)
# #For Printing
ll <- kk %>% 
  add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.double), sum))) %>% 
  mutate(across(where(is.numeric), format, digits =2)) %>% 
  mutate(sFs = ifelse(ID == "Total", paste0("E(s) = ", sFs), sFs),
         s_Es_sq_Fs = ifelse(ID == "Total", paste0("Var(s) = ", s_Es_sq_Fs), s_Es_sq_Fs)) %>% 
  mutate(across(c(2, 5, 6), ~ replace(., ID == "Total", NA)))

Bivariate to Original

bb <- xxdicarlo_gs
# #From the Bivariate get the original data
ii <- bb %>% 
  mutate(Fx = rowSums(across(where(is.numeric)))) %>% 
  select(1, 8) %>% 
  separate(col = Geneva_Saratoga, into = c(NA, "x"), sep = 1) %>% 
  mutate(across(1, as.integer))
# #Variance Calculation
jj <- ii %>% mutate(across(Fx, ~./sum(Fx))) %>% 
  mutate(xFx = x * Fx, x_mu = x - sum(xFx), 
             x_mu_sq = x_mu * x_mu, x_mu_sq_Fx = x_mu_sq * Fx) 
# #Save for Notebook
R_dicarlo_var_x_C05 <- sum(jj$x_mu_sq_Fx)
print(jj)
## # A tibble: 4 x 6
##       x     Fx   xFx   x_mu x_mu_sq x_mu_sq_Fx
##   <int>  <dbl> <dbl>  <dbl>   <dbl>      <dbl>
## 1     0 0.287  0     -1.14   1.31      0.375  
## 2     1 0.37   0.37  -0.143  0.0205    0.00760
## 3     2 0.257  0.513  0.857  0.734     0.188  
## 4     3 0.0867 0.26   1.86   3.45      0.299

Sum Diagonals

bb <- xxdicarlo_gs
#
# #Tibble Total SUM 
sum_bb <- bb %>% summarise(across(-1, sum)) %>% summarise(sum(.)) %>% pull(.)
#
# #Convert to Bivirate Probability Distribution and Exclude First Character Column
ii <- bb %>% mutate(across(where(is.numeric), ~./sum_bb)) %>% select(-1)
#
# #(1A, 2B, 3C, 4D, 4E, 4F, 3G, 2H, 1I) 9 Unique Combinations = 24 (4x6) Experimental Outcomes 
matrix(data = LETTERS[row(ii) + col(ii)-1], nrow = 4)
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] "A"  "B"  "C"  "D"  "E"  "F" 
## [2,] "B"  "C"  "D"  "E"  "F"  "G" 
## [3,] "C"  "D"  "E"  "F"  "G"  "H" 
## [4,] "D"  "E"  "F"  "G"  "H"  "I"
# 
# #Using tapply(), sum the Matrix
jj <- tapply(X= as.matrix(ii), INDEX = LETTERS[row(ii) + col(ii)-1], FUN = sum)
print(jj)
##          A          B          C          D          E          F          G          H          I 
## 0.07000000 0.17000000 0.23000000 0.29000000 0.12666667 0.06666667 0.02333333 0.02333333 0.00000000
# #In place of LETTERS, Numerical Index can also be used but Letters are more clear for grouping
#tapply(X= as.matrix(ii), INDEX = c(0:8)[row(ii) + col(ii)-1], FUN = sum)
#
# #Create Tibble
kk <- tibble(Fs = jj, ID = LETTERS[1:length(Fs)], s = 1:length(Fs) - 1) %>% 
  relocate(Fs, .after = last_col())
print(kk)
## # A tibble: 9 x 3
##   ID        s     Fs
##   <chr> <dbl>  <dbl>
## 1 A         0 0.07  
## 2 B         1 0.17  
## 3 C         2 0.23  
## 4 D         3 0.29  
## 5 E         4 0.127 
## 6 F         5 0.0667
## 7 G         6 0.0233
## 8 H         7 0.0233
## 9 I         8 0

String Split

bb <- xxdicarlo_gs
# #Separate String based on Position 
bb %>% separate(col = Geneva_Saratoga, into = c("A", "B"), sep = 1) 
## # A tibble: 4 x 8
##   A     B        y0    y1    y2    y3    y4    y5
##   <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 x     0        21    30    24     9     2     0
## 2 x     1        21    36    33    18     2     1
## 3 x     2         9    42     9    12     3     2
## 4 x     3         3     9     6     3     5     0

20.9 Covariance

  • Covariance of random variables x and y is given by \(\sigma_{xy}\), Refer equation (20.1)
    • NOTE: It does not look like (18.15), but for now, I am assuming it is Equivalent
    • Calculated: \(\text{Var}(s) = \text{Var}(x + y) =\) 2.389; \(\text{Var}(y) =\) 1.25; \(\text{Var}(x) =\) 0.869
    • Variance \(\sigma_{xy} = \frac{2.3895 - 0.8696 - 1.25}{2} = 0.1350\)
    • A covariance of .1350 indicates that daily sales at the two dealerships have a positive relationship.

\[\sigma_{xy} = \frac{\text{Var}(x + y) - \text{Var}(x) - \text{Var}(y)}{2} \tag{20.1}\]

  • Correlation of random variables x and y is given by, Refer equation (18.16), \(\rho_{xy} = \frac{\sigma_{xy}}{\sigma_{x}\sigma_{y}}\)
    • Where \(\sigma_{x} = \sqrt{\text{Var}(x)} = \sqrt{0.8696} = 0.9325\); and \(\sigma_{y} = \sqrt{\text{Var}(y)} = \sqrt{1.25} = 1.1180\)
    • Correlation Coefficient \(\rho_{xy} = \frac{0.1350}{0.9325 \times 1.1180} = 0.1295\)
    • The correlation coefficient of .1295 indicates there is a weak positive relationship between the random variables representing daily sales at the two dealerships.

20.10 Distributions

  • “ForLater”
    • Binomial Probability Distribution - dbinom(), pbinom(), qbinom(), rbinom()
      • It can be used to determine the probability of obtaining \({x}\) successes in \({n}\) trials.
      • 4 Assumptions must be TRUE
        1. The experiment consists of a sequence of \({n}\) identical trials.
        2. Two outcomes are possible on each trial, one called success and the other failure.
        3. The probability of a success \({p}\) does not change from trial to trial. Consequently, the probability of failure, \(1 − p\), does not change from trial to trial.
        4. The trials are independent.
    • Poisson Probability Distribution - dpois(), ppois(), qpois(), rpois()
      • To determine the probability of obtaining \({x}\) occurrences over an interval of time or space.
      • 2 Assumptions must be TRUE
        1. The probability of an occurrence of the event is the same for any two intervals of equal length.
        2. The occurrence or nonoccurrence of the event in any interval is independent of the occurrence or nonoccurrence of the event in any other interval.
    • Hypergeometric Probability Distribution
      • Like the binomial, it is used to compute the probability of \({x}\) successes in \({n}\) trials.
      • But, in contrast to the binomial, the probability of success changes from trial to trial.

Validation


21 Continuous Probability Distributions

21.1 Overview

21.2 Definitions (Ref)

20.2 A random variable that may assume either a finite number of values or an infinite sequence of values such as \(0, 1, 2, \dots\) is referred to as a discrete random variable. It includes factor type i.e. Male as 0, Female as 1 etc.

20.3 A random variable that may assume any numerical value in an interval or collection of intervals is called a continuous random variable. It is given by \(x \in [n, m]\). If the entire line segment between the two points also represents possible values for the random variable, then the random variable is continuous.

21.3 Uniform Probability Distribution

Definition 21.1 Uniform probability distribution is a continuous probability distribution for which the probability that the random variable will assume a value in any interval is the same for each interval of equal length. Whenever the probability is proportional to the length of the interval, the random variable is uniformly distributed.
Definition 21.2 The probability that the continuous random variable \({x}\) takes a value between \([a, b]\) is given by the area under the graph of probability density function \(f(x)\); that is, \(A = \int _{a}^{b}f(x)\ dx\). Note that \(f(x)\) can be greater than 1, however its integral must be equal to 1.
  • Basic Requirements (Similar to the Probability Basics & Discrete Probability )
    1. \(f(x) \geq 0\)
    2. \(A = \int _{-\infty}^{\infty}f(x)\ dx = 1\)
  • NOTE:
    • For a discrete random variable, the probability function \(f(x)\) provides the probability that the random variable assumes a particular value. With continuous random variables, the counterpart of the probability function is the probability density function \(f(x)\).
      • The difference is that the probability density function does not directly provide probabilities. - However, the area under the graph of \(f(x)\) corresponding to a given interval does provide the probability that the continuous random variable \({x}\) assumes a value in that interval.
      • So when we compute probabilities for continuous random variables we are computing the probability that the random variable assumes any value in an interval (NOT at any particular point).
      • Because the area under the graph of \(f(x)\) at any particular point is zero, the probability of any particular value of the random variable is zero.
      • It also means that the probability of a continuous random variable assuming a value in any interval is the same whether or not the endpoints are included.
    • Expected Value and Variance are given by (21.1)

\[\begin{align} E(x) &= \frac{a + b}{2} \\ \text{Var}(x) &= \frac{(b - a)^2}{12} \end{align} \tag{21.1}\]

21.4 Normal Probability Distribution

Definition 21.3 A normal distribution (\({\mathcal{N}}_{({\mu}, \, {\sigma}^2)}\)) is a type of continuous probability distribution for a real-valued random variable.
  • The general form of its probability density function is given by equation (21.2)
    • Normal distribution \({\mathcal{N}}_{({\mu}, \, {\sigma})}\) is also known as Gaussian or Gauss or Laplace–Gauss distribution
    • It is symmetrical
    • The entire family of normal distributions is differentiated by two parameters: the mean \({\mu}\) and the standard deviation \({\sigma}\). They determine the location and shape of the normal distribution.
    • The highest point on the normal curve is at the mean, which is also the median and mode of the distribution.
    • The normal distribution is symmetric around its mean. Its skewness measure is zero.
    • The tails of the normal curve extend to infinity in both directions and theoretically never touch the horizontal axis.
    • Larger values of the standard deviation result in wider, flatter curves, showing more variability in the data.
    • Probabilities for the normal random variable are given by areas under the normal curve. The total area under the curve for the normal distribution is 1.
    • Values of a normal random variable are given as: \(68.27\% ({\mu} \pm {\sigma}), 95.45\% ({\mu} \pm 2{\sigma}), 99.73\% ({\mu} \pm 3{\sigma})\). This is the basis of Empirical Rule

\[f(x) = {\frac {1}{{\sigma}{\sqrt {2 \pi}}}} e^{-{\frac {1}{2}}\left( {\frac {x-{\mu} }{\sigma}}\right) ^{2}} \tag{21.2}\]

Normal Distribution

Figure 21.1 Normal Distribution

Histogram

# #Histogram with Density Curve, Mean and Median: Normal Distribution
ee <- f_getRDS(xxNormal)
hh <- tibble(ee)
ee <- NULL
# #Basics
median_hh <- round(median(hh[[1]]), 3)
mean_hh <- round(mean(hh[[1]]), 3)
sd_hh <- round(sd(hh[[1]]), 3)
len_hh <- nrow(hh)
#
# #Base Plot: Creates Only Density Function Line
ii <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) + geom_density() }
#
# #Change the line colour and alpha
ii <- ii + geom_density(alpha = 0.2, colour = "#21908CFF") 
#
# #Add Histogram with 50 bins, alpha and fill
ii <- ii + geom_histogram(aes(y = ..density..), bins = 50, alpha = 0.4, fill = '#FDE725FF')
#
# #Full Vertical Line at Mean. Goes across Function Boundary on Y-Axis
#ii <- ii + geom_vline(aes(xintercept = mean_hh), color = '#440154FF')
#
# #Shaded Area Object for line /Area upto the the Function Boundary on Y-Axis
# #Mean
ii_mean <- ggplot_build(ii)$data[[1]] %>% filter(x <= mean_hh)  
# #Median
ii_median <- ggplot_build(ii)$data[[1]] %>% filter(x <= median_hh)
#
# #To show values which are less than Mean in colour
#ii <- ii + geom_area(data = ii_mean, aes(x = x, y = y), fill = 'blue', alpha = 0.5) 
#
# #Line upto the Density Curve at Mean 
ii <- ii + geom_segment(data = ii_mean, 
             aes(x = mean_hh, y = 0, xend = mean_hh, yend = density), color = "#440154FF")
#
# #Label 'Mean' 
ii <- ii + geom_text(aes(label = paste0("Mean= ", mean_hh), x = mean_hh, y = -Inf),
            color = '#440154FF', hjust = -0.5, vjust = -1, angle = 90, check_overlap = TRUE)
#
# #Similarly, Median Line and Label
ii <- ii + geom_segment(data = ii_median, 
             aes(x = median_hh, y = 0, xend = median_hh, yend = density), color = "#3B528BFF") +
  geom_text(aes(label = paste0("Median= ", median_hh), x = median_hh, y = -Inf), 
            color = '#3B528BFF', hjust = -0.4, vjust = 1.2, angle = 90, check_overlap = TRUE) 
#
# #Change Axis Limits
ii <- ii + coord_cartesian(xlim = c(-5, 5), ylim = c(0, 0.5))
#
# #Change x-Axis Ticks interval
xbreaks_hh <- seq(-3, 3)
xpoints_hh <- mean_hh + xbreaks_hh * sd_hh
# # Latex Labels 
xlabels_hh <- c(TeX(r'($\,\,\mu - 3 \sigma$)'), TeX(r'($\,\,\mu - 2 \sigma$)'), 
                TeX(r'($\,\,\mu - 1 \sigma$)'), TeX(r'($\mu$)'), TeX(r'($\,\,\mu + 1 \sigma$)'), 
                TeX(r'($\,\,\mu + 2 \sigma$)'), TeX(r'($\,\,\mu + 3\sigma$)'))
#
ii <- ii + scale_x_continuous(breaks = xpoints_hh, labels = xlabels_hh)
#
# #Get Quantiles and Ranges of mean +/- sigma 
q05_hh <- quantile(hh[[1]], .05)
q95_hh <- quantile(hh[[1]], .95)
density_hh <- density(hh[[1]])
density_hh_tbl <- tibble(x = density_hh$x, y = density_hh$y)
sig3l_hh <- density_hh_tbl %>% filter(x <= mean_hh - 3 * sd_hh)
sig3r_hh <- density_hh_tbl %>% filter(x >= mean_hh + 3 * sd_hh)
sig2r_hh <- density_hh_tbl %>% filter(x >= mean_hh + 2 * sd_hh, x < mean_hh + 3 * sd_hh)
sig2l_hh <- density_hh_tbl %>% filter(x <= mean_hh - 2 * sd_hh, x > mean_hh - 3 * sd_hh)
sig1r_hh <- density_hh_tbl %>% filter(x >= mean_hh + sd_hh, x < mean_hh + 2 * sd_hh)
sig1l_hh <- density_hh_tbl %>% filter(x <= mean_hh - sd_hh, x > mean_hh - 2 * sd_hh)
#
# #Use (mean +/- 3 sigma) To Highlight. NOT ALL Zones have been highlighted
ii <- ii + geom_area(data = sig3l_hh, aes(x = x, y = y), fill = 'red') +
           geom_area(data = sig3r_hh, aes(x = x, y = y), fill = 'red')
#
# #Annotate Arrows 
ii <- ii + 
#  ggplot2::annotate("segment", x = xpoints_hh[4] -0.5 , xend = xpoints_hh[3], y = 0.42, 
#                    yend = 0.42, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
#  ggplot2::annotate("segment", x = xpoints_hh[4] -0.5 , xend = xpoints_hh[2], y = 0.45, 
#                    yend = 0.45, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
  ggplot2::annotate("segment", x = xpoints_hh[4] -0.5 , xend = xpoints_hh[1], y = 0.48, 
                    yend = 0.48, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
#  ggplot2::annotate("segment", x = xpoints_hh[4] +0.5 , xend = xpoints_hh[5], y = 0.42, 
#                    yend = 0.42, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
#  ggplot2::annotate("segment", x = xpoints_hh[4] +0.5 , xend = xpoints_hh[6], y = 0.45, 
#                    yend = 0.45, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
  ggplot2::annotate("segment", x = xpoints_hh[4] +0.5 , xend = xpoints_hh[7], y = 0.48, 
                    yend = 0.48, arrow = arrow(type = "closed", length = unit(0.02, "npc")))
#
# #Annotate Labels
ii <- ii + 
#  ggplot2::annotate(geom = "text", x = xpoints_hh[4], y = 0.42, label = "68.3%") +
#  ggplot2::annotate(geom = "text", x = xpoints_hh[4], y = 0.45, label = "95.4%") +
  ggplot2::annotate(geom = "text", x = xpoints_hh[4], y = 0.48, label = "99.7%")
#
# #Add a Theme and adjust Position of Title & Subtile (Both by plot.title.position) & Caption
# #"plot" or "panel"
ii <- ii + theme(#plot.tag.position = "topleft",
                 #plot.caption.position = "plot", 
                 #plot.caption = element_text(hjust = 0),
                 plot.title.position = "panel")
#
# #Title, Subtitle, Caption, Axis Labels, Tag
ii <- ii + labs(x = "x", y = "Density", 
        subtitle = paste0("(N=", len_hh, "; ", "Mean= ", mean_hh, 
                          "; Median= ", median_hh, "; SD= ", sd_hh), 
        caption = "C06AA", tag = NULL,
        title = "Normal Distribution (Symmetrical)")
#
#ii

Plot LaTex

# #Syntax 
#latex2exp::Tex(r('$\sigma =10$'), output = "character")
# #Test Equation
plot(TeX(r'(abc: $\frac{2hc^2}{\lambda^5} \, \frac{1}{e^{\frac{hc}{\lambda k_B T}} - 1}$)'), cex=2)
plot(TeX(r'(xyz: $f(x) =\frac{1}{\sigma \sqrt{2\pi}}\, e^{- \, \frac{1}{2} \,\left(\frac{x - \mu}{\sigma}\right)^2} $)'), cex=2)

Annotate Plot

# #Syntax
ggpp::annotate("text", x = -2, y = 0.3, label=TeX(r'($\sigma =10$)', output = "character"), parse = TRUE, check_overlap = TRUE)
# #NOTE: Complex Equations like Normal Distribution are crashing the R.
ggpp::annotate("text", x = -2, y = 0.3, label=TeX(r'($f(x) =\frac{1}{\sigma \sqrt{2\pi}}\, e^{- \, \frac{1}{2} \, \left(\frac{x - \mu}{\sigma}\right)^2} $)', output = "character"), parse = TRUE, check_overlap = TRUE)

ggplot_build()

# #Data
bb <- f_getRDS(xxNormal)
hh <- tibble(bb)
# #Base Plot
ii <- hh %>% { ggplot(data = ., mapping = aes(x = bb)) + geom_density() }
# #Attributes 
attributes(ggplot_build(ii))$names
## [1] "data"   "layout" "plot"
#
str(ggplot_build(ii)$data[[1]])
## 'data.frame':    512 obs. of  18 variables:
##  $ y          : num  0.000504 0.00052 0.000532 0.000541 0.000545 ...
##  $ x          : num  -3.63 -3.61 -3.6 -3.58 -3.57 ...
##  $ density    : num  0.000504 0.00052 0.000532 0.000541 0.000545 ...
##  $ scaled     : num  0.00126 0.0013 0.00133 0.00136 0.00137 ...
##  $ ndensity   : num  0.00126 0.0013 0.00133 0.00136 0.00137 ...
##  $ count      : num  5.04 5.2 5.32 5.41 5.45 ...
##  $ n          : int  10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 ...
##  $ flipped_aes: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ PANEL      : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ group      : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ ymin       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ymax       : num  0.000504 0.00052 0.000532 0.000541 0.000545 ...
##  $ fill       : logi  NA NA NA NA NA NA ...
##  $ weight     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ colour     : chr  "black" "black" "black" "black" ...
##  $ alpha      : logi  NA NA NA NA NA NA ...
##  $ size       : num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ linetype   : num  1 1 1 1 1 1 1 1 1 1 ...

Errors

ERROR 21.1 Error in is.finite(x) : default method not implemented for type ’list’
  • For ggplot() subsetting inside aes() is discouraged.
  • Assuming names(hh)[1] is “ee”
    • either use (x = “ee”) : Use of hh[1] or .[1] will throw error
    • or use (x = .data[[“ee”]]) : Use of hh[[1]] or .[[1]] will work but would throw warning.
      • Warning “Warning: Use of .[[1]] is discouraged. Use .data[[1]] instead.”
      • Using .data[[1]] will throw different error
ERROR 21.2 Error: Must subset the data pronoun with a string.
  • ggplot() | aes() | using .data[[1]] will throw this error
  • use .data[[“ee”]] or “ee”
    • .data is pronoun for an environment, it is for scope resolution, not dataframe like dot (.)

UNICODE

STOP! STOP! Just STOP! using UNICODE for R Console on WINDOWS (UTF-8 Issue).

21.5 Standard Normal

Definition 21.4 A random variable that has a normal distribution with a mean of zero \(({\mu} = 0)\) and a standard deviation of one \(({\sigma} = 1)\) is said to have a standard normal probability distribution. The z-distribution is given by \({\mathcal{z}}_{({\mu} = 0, \, {\sigma} = 1)}\)

\[f(z) = \varphi(x) = \frac{1}{\sqrt{2\pi}}e^{-\frac{z^2}{2}} \tag{21.3}\]

  • Refer equation (21.3)
    • Here, the factor \(1/{\sqrt{2\pi}}\) ensures that the total area under the curve \(\varphi(x)\) is equal to one.
    • The factor \(1/2\) in the exponent ensures that the distribution has unit variance, and therefore also unit standard deviation.
    • This function is symmetric around \(x = 0\), where it attains its maximum value \(1/{\sqrt{2\pi}}\) and has inflection points at \(x = +1\) and \(x = -1\).
    • While individual observations from normal distributions are referred to as \({x}\), they are referred to as \({z}\) in the z-distribution.

18.20 The z-score, \({z_i}\), can be interpreted as the number of standard deviations \({x_i}\) is from the mean \({\overline{x}}\). It is associated with each \({x_i}\). The z-score is often called the standardized value or standard score.

  • NOTE (R, C) notation denotes Row x Column of a Table
    • Because the standard normal random variable is continuous, \(P(z \leq 1.00) = P(z < 1.00)\)
    • The cumulative probability corresponding to \(z = 1.00\) is the table value located at the intersection of the row labeled \({1.0}\) and the column labeled \({.00}\) i.e. \(P_{\left(z\leq 1.00\right)} = P_{\left(1.0, \, .00\right)} = 0.8413\)
    • To compute the probability that \({z}\) is in the interval between −.50 and 1.25
      • \(P_{\left(-0.50 \leq z\leq 1.25\right)} = P_{\left(z\leq 1.25\right)} - P_{\left(z\leq -0.50 \right)} = P_{\left(1.2, \, .05\right)} - P_{\left(-0.50, \, .00\right)} = 0.8944 - 0.3085 = 0.5859\)
    • To compute the probability of obtaining a z value of at least 1.58
      • \(P_{\left(z\geq 1.58\right)} = 1 - P_{\left(z\leq 1.58\right)} = 1 - P_{\left(1.5, \, .08\right)} = 1 - 0.9429 = 0.0571\)
    • To compute the probability that the standard normal random variable is within one standard deviation of the mean
      • \(P_{\left(-1.00 \leq z\leq 1.00\right)} = P_{\left(z\leq 1.00\right)} - P_{\left(z\leq -1.00 \right)} = P_{\left(1.0, \, .00\right)} - P_{\left(-1.0, \, .00\right)} = 0.8413 - 0.1587 = 0.6826\)
  • Reverse i.e. given the probability, find out the z-value
    • Find a z value such that the probability of obtaining a larger z value is .10
      • The standard normal probability table gives the area under the curve to the left of a particular z value, which would be \(P_{\left(z\right)} = 1 - 0.10 = 0.9000 \approx P_{\left(1.2, \, .08\right)} \to z = 1.28\)

\({z} \in \mathbb{R} \iff P_{(z)} \in (0, 1)\)

Cal P

# #Find Commulative Probability P corresponding to the given 'z' value
# #Area under the curve to the left of z-value = 1.00
pnorm(q = 1.00)
## [1] 0.8413447

pnorm()

# #Find Commulative Probability P corresponding to the given 'z' value
# #Area under the curve to the left of z-value = 1.00
# #pnorm(q = 1.00) #default 'lower.tail = TRUE'
z_ii <- 1.00 
p_ii <- round(pnorm(q = z_ii, lower.tail = TRUE), 4)
cat(paste0("P(z <= ", format(z_ii, nsmall = 3), ") = ", p_ii, "\n"))
## P(z <= 1.000) = 0.8413
#
# #Probability that z is in the interval between −.50 and 1.25 #0.5859
z_min_ii <- -0.50
z_max_ii <- 1.25
p_ii <- round(pnorm(q = z_max_ii, lower.tail = TRUE) - pnorm(q = z_min_ii, lower.tail = TRUE), 4)
cat(paste0("P(", format(z_min_ii, nsmall = 3), " <= z <= ", 
           format(z_max_ii, nsmall = 3), ") = ", p_ii, "\n"))
## P(-0.500 <= z <= 1.250) = 0.5858
#
# #Probability of obtaining a z value of at least 1.58 #0.0571
z_ii <- 1.58
p_ii <- round(pnorm(q = z_ii, lower.tail = FALSE), 4)
cat(paste0("P(z >= ", format(z_ii, nsmall = 3), ") = ", p_ii, "\n"))
## P(z >= 1.580) = 0.0571
#
# #Probability that the z is within one standard deviation of the mean i.e. [-1, 1] #0.6826
z_min_ii <- -1.00
z_max_ii <- 1.00
p_ii <- round(pnorm(q = z_max_ii, lower.tail = TRUE) - pnorm(q = z_min_ii, lower.tail = TRUE), 4)
cat(paste0("P(", format(z_min_ii, nsmall = 3), " <= z <= ", 
           format(z_max_ii, nsmall = 3), ") = ", p_ii, "\n"))
## P(-1.000 <= z <= 1.000) = 0.6827

Cal Z

# #Find a z value such that the probability of obtaining a larger z value is .10
# #z-value for which Area under the curve towards Right is 0.10
qnorm(p = 1 - 0.10)
## [1] 1.281552
qnorm(p = 0.10, lower.tail = FALSE)
## [1] 1.281552

qnorm()

# #Find a z value such that the probability of obtaining a larger z value is .10
# #z-value for which Area under the curve towards Right is 0.10 i.e. right >10%
#qnorm(p = 1 - 0.10)
#qnorm(p = 0.10, lower.tail = FALSE)
p_r_ii <- 0.10 
p_l_ii <- 1 - p_r_ii
z_ii <- round(qnorm(p = p_l_ii, lower.tail = TRUE), 4)
z_jj <- round(qnorm(p = p_r_ii, lower.tail = FALSE), 4)
stopifnot(identical(z_ii, z_jj))
cat(paste0("(Left) P(z) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(z) = ", 
           format(p_r_ii, nsmall = 3), ") at z = ", z_ii, "\n"))
## (Left) P(z) = 0.900 (i.e. (Right) 1-P(z) = 0.100) at z = 1.2816

21.6 Any Normal

  • Any normal distribution can be standardized by converting the individual values into z-scores.
    • z-scores tell that how many standard deviations away from the mean each value lies.
  • Probabilities for all normal distributions are computed by using the standard normal distribution.
    • A normal distribution \({\mathcal{N}}_{({\mu}, \, {\sigma})}\) is converted to the standard normal distribution \({\mathcal{z}}_{({\mu} = 0, \, {\sigma} = 1)}\) by equation (21.4) (Similar to equation (18.14))
    • If \({x}\) is a random variable from this population, then its z-score is \(Z = \frac {X - {\mu}}{\sigma}\)
    • If \(\overline{X}\) is the mean of a sample of size \({n}\) from this population, then the standard error is \({\sigma}/{\sqrt{n}}\) and thus the z-score is \(Z = \frac{\overline{X} - {\mu}}{{\sigma}/{\sqrt{n}}}\)
    • If \(\sum {X}\) is the total of a sample of size \({n}\) from this population, then the expected total is \(n\times{\mu}\) and the standard error is \({\sigma}{\sqrt{n}}\). Thus the z-score is \(Z = {\frac{\sum{X}-n{\mu}}{{\sigma}{\sqrt{n}}}}\)
  • Thus
    • \(x = {\mu} \to z = 0\) i.e. A value of \({x}\) equal to its mean \({\mu}\) corresponds to \(z = 0\).
    • \(x = {\mu} + {\sigma} \to z = 1\) i.e. an \({x}\) value that is one standard deviation above its mean \(({\mu} + {\sigma})\) corresponds to \(z = 1\).
      • Thus, we can interpret \({z}\) as the number of standard deviations \(({\sigma})\) that the normal random variable \({x}\) is from its mean \(({\mu})\).
    • For a normal distribution \({\mathcal{N}}_{({\mu} = 10, \, {\sigma} = 2)}\), What is the probability that the random variable x is between 10 and 14
      • At x = 10, z = 0 and at x = 14, z = 2, Thus
      • \(P_{\left(0 \leq z\leq 2\right)} = P_{\left(z\leq 2\right)} - P_{\left(z\leq 0 \right)} = P_{\left(2.0, \, .00\right)} - P_{\left(0, \, .00\right)} = 0.9772 - 0.5000 = 0.4772\)
    • Grear Tire Company Problem
      • For a new tire product the milage is a Normal Function \({\mathcal{N}}_{({\mu} = 36500, \, {\sigma} = 5000)}\).
      • What percentage of the tires can be expected to last more than 40,000 miles, i.e., what is the probability that the tire mileage, x, will exceed 40,000
        • Solution: 24.2%
      • Let us now assume that Grear is considering a guarantee that will provide a discount on replacement tires if the original tires do not provide the guaranteed mileage. What should the guarantee mileage be if Grear wants no more than 10% of the tires to be eligible for the discount guarantee
        • Solution: \(30092 \approx 30100 \text{ miles}\)

\[z = \frac{x - {\mu}}{{\sigma}} \tag{21.4}\]

Reasons to convert normal distributions into the standard normal distribution:

  • To find the probability of observations in a distribution falling above or below a given value
  • To find the probability that a sample mean significantly differs from a known population mean
  • To compare scores on different distributions with different means and standard deviations

Each z-score is associated with a probability, or p-value, that gives the likelihood of values below that z-score occurring. By converting an individual value into a z-score, we can find the probability of all values up to that value occurring in a normal distribution.

The z-score is the test statistic used in a z-test. The z-test is used to compare the means of two groups, or to compare the mean of a group to a set value. Its null hypothesis typically assumes no difference between groups.

The area under the curve to the right of a z-score is the p-value, and it’s the likelihood of your observation occurring if the null hypothesis is true.

Usually, a p-value of 0.05 or less means that your results are unlikely to have arisen by chance; it indicates a statistically significant effect.

Cal P

# #For N(mu =10, sd =2) Probability that X is in [10, 14]
# #Same as P(0 <= z <= 2)
mu_ii <- 10
sd_ii <- 2
x_min_ii <- 10
x_max_ii <- 14
#
z_min_ii <- (x_min_ii - mu_ii) /sd_ii #0
z_max_ii <- (x_max_ii - mu_ii) /sd_ii #2
#
pz_ii <- round(pnorm(q = z_max_ii, lower.tail = TRUE) - pnorm(q = z_min_ii, lower.tail = TRUE), 4)
# #OR
px_ii <- round(pnorm(q = x_max_ii, mean = mu_ii, sd = sd_ii, lower.tail = TRUE) - 
                  pnorm(q = x_min_ii, mean = mu_ii, sd = sd_ii, lower.tail = TRUE), 4)
stopifnot(identical(pz_ii, px_ii))
cat(paste0("P(", format(z_min_ii, nsmall = 3), " <= z <= ", 
           format(z_max_ii, nsmall = 3), ") = ", pz_ii, "\n"))
## P(0.000 <= z <= 2.000) = 0.4772
cat(paste0("P(", x_min_ii, " <= x <= ", x_max_ii, ") = ", format(px_ii, nsmall = 3), "\n"))
## P(10 <= x <= 14) = 0.4772

Grear Tire

# #Grear Tire N(mu = 36500, sd =5000)
# #Probability that the tire mileage, x, will exceed 40000 # 24.2% Tires
mu_ii <- 36500
sd_ii <- 5000
x_ii <- 40000
#
z_ii <- (x_ii - mu_ii)/sd_ii
#
#pnorm(q = 40000, mean = 36500, sd = 5000, lower.tail = FALSE)
pz_ii <- round(pnorm(q = z_ii, lower.tail = FALSE), 4)
px_ii <- round(pnorm(q = x_ii, mean = mu_ii, sd = sd_ii, lower.tail = FALSE), 4)
stopifnot(identical(px_ii, pz_ii))
#
cat(paste0("P(x >= ", x_ii, ") = ", format(px_ii, nsmall = 4), " (", 
           round(100* px_ii, 2), "%)\n"))
## P(x >= 40000) = 0.2420 (24.2%)
#
# #What should the guarantee mileage be if no more than 10% of the tires to be eligible 
# #for the discount guarantee i.e. left <10% # ~30100 miles
p_l_ii <- 0.10
p_r_ii <- 1 - p_l_ii
#
#qnorm(p = 0.10, mean = 36500, sd = 5000)
z_ii <- round(qnorm(p = p_l_ii, lower.tail = TRUE), 4)
xz_ii <- z_ii * sd_ii + mu_ii
#
x_ii <- round(qnorm(p = p_l_ii, mean = mu_ii, sd = sd_ii, lower.tail = TRUE), 4)
stopifnot(abs(xz_ii - x_ii) < 1)
cat(paste0("(Left) P(x) = ", p_l_ii, " (i.e. (Right) 1-P(z) = ", p_r_ii, 
           ") at x = ", round(x_ii, 1), "\n"))
## (Left) P(x) = 0.1 (i.e. (Right) 1-P(z) = 0.9) at x = 30092.2

Exercises

  • “ForLater”
    • Exercises
    • Normal Approximation of Binomial Probabilities
    • Exponential Probability Distribution
    • Relationship Between the Poisson and Exponential Distributions

Validation


22 Sampling Distributions

22.1 Overview

22.2 Definitions (Ref)

16.2 Elements are the entities on which data are collected. (Generally ROWS)

16.3 A variable is a characteristic of interest for the elements. (Generally COLUMNS)

16.20 A population is the set of all elements of interest in a particular study.

16.21 A sample is a subset of the population.

16.22 The measurable quality or characteristic is called a Population Parameter if it is computed from the population. It is called a Sample Statistic if it is computed from a sample.

22.3 Sample

The sample contains only a portion of the population. Some sampling error is to be expected. So, the sample results provide only estimates of the values of the corresponding population characteristics.

Definition 22.1 The sampled population is the population from which the sample is drawn.
Definition 22.2 Frame is a list of the elements that the sample will be selected from.
Definition 22.3 The target population is the population we want to make inferences about. Generally (adn preferably), it will be same as ‘Sampled-Population,’ but it may differ also.
Definition 22.4 A simple random sample (SRS) is a set of \({k}\) objects in a population of \({N}\) objects where all possible samples are equally likely to happen. The number of such different simple random samples is \(C_k^N\)
Definition 22.5 Sampling without replacement: Once an element has been included in the sample, it is removed from the population and cannot be selected a second time.
Definition 22.6 Sampling with replacement: Once an element has been included in the sample, it is returned to the population. A previously selected element can be selected again and therefore may appear in the sample more than once.
  • Infinite Population
    • Sometimes the population is infinitely large or the elements of the population are being generated by an ongoing process for which there is no limit on the number of elements that can be generated.
    • Thus, it is not possible to develop a list of all the elements in the population. This is considered the infinite population case.
    • With an infinite population, we cannot select a ‘simple random sample’ because we cannot construct a frame consisting of all the elements.
    • In the infinite population case, statisticians recommend selecting what is called a ‘random sample.’
Definition 22.7 A random sample of size \({n}\) from an infinite population is a sample selected such that the following two conditions are satisfied. Each element selected comes from the same population. Each element is selected independently. The second condition prevents selection bias.

Random sample vs. SRS

  • Random sample: every element of the population has a (nonzero) probability of being drawn.
    • each element does not necessarily have an equal chance of being chosen.
  • SRS: every element of the population has the same (nonzero) probability of being drawn.
    • SRS is thus a special case of a random sample.
    • SRS is a subset of a statistical population in which each member of the subset has an equal probability of being chosen.
  • Elaboration of the Two conditions for Random Sample
    • Example: Consider a production line designed to fill boxes of a breakfast cereal.
      • Each element selected comes from the same population.
        • To ensure this, the boxes must be selected at approximately the same point in time.
        • This way the inspector avoids the possibility of selecting some boxes when the process is operating properly and other boxes when the process is not operating properly.
      • Each element is selected independently.
        • It is satisfied by designing the production process so that each box of cereal is filled independently.
    • Example: Consider the population of customers arriving at a fast-food restaurant.
      • McDonald, implemented a random sampling procedure for this situation.
      • The sampling procedure was based on the fact that some customers presented discount coupons.
      • Whenever a customer presented a discount coupon, the next customer served was asked to complete a customer profile questionnaire. Because arriving customers presented discount coupons randomly and independently of other customers, this sampling procedure ensured that customers were selected independently.

22.4 Point Estimation

Definition 22.8 A population proportion \({P}\), is a parameter that describes a percentage value associated with a population. It is given by \(P = \frac{X}{N}\), where \({x}\) is the count of successes in the population, and \({N}\) is the size of the population. It is estimated through sample proportion \(\overline{p} = \frac{x}{n}\), where \({x}\) is the count of successes in the sample, and \({N}\) is the size of the sample obtained from the population.
Definition 22.9 To estimate the value of a population parameter, we compute a corresponding characteristic of the sample, referred to as a sample statistic. This process is called point estimation.
Definition 22.10 A sample statistic is the point estimator of the corresponding population parameter. For example, \(\overline{x}, s, s^2, s_{xy}, r_{xy}\) sample statics are point estimators for corresponding population parameters of \({\mu}\) (mean), \({\sigma}\) (standard deviation), \(\sigma^2\) (variance), \(\sigma_{xy}\) (covariance), \(\rho_{xy}\) (correlation)
Definition 22.11 The numerical value obtained for the sample statistic is called the point estimate. Estimate is used for sample value only, for population value it would be parameter. Estimate is a value while Estimator is a function.

22.5 Sampling Distributions

Definition 22.12 The sampling distribution of \({\overline{x}}\) is the probability distribution of all possible values of the sample mean \({\overline{x}}\).

Suppose, from a Population, we take a sample of size \({n}\) and calculate point estimate mean \(\overline{x}_{1}\). Further, we can select another random sample from the Population and get another point estimate mean \(\overline{x}_{2}\). If we repeat this process for 500 times, we will have a frame of \(\{\overline{x}_{1}, \overline{x}_{2}, \ldots, \overline{x}_{500}\}\).

If we consider the process of selecting a simple random sample as an experiment, the sample mean \({\overline{x}}\) is the numerical description of the outcome of the experiment. Thus, the sample mean \({\overline{x}}\) is a random variable. As a result, just like other random variables, \({\overline{x}}\) has a mean or expected value, a standard deviation, and a probability distribution. Because the various possible values of \({\overline{x}}\) are the result of different simple random samples, the probability distribution of \({\overline{x}}\) is called the sampling distribution of \({\overline{x}}\). Knowledge of this sampling distribution and its properties will enable us to make probability statements about how close the sample mean \({\overline{x}}\) is to the population mean \({\mu}\).

Just as with other probability distributions, the sampling distribution of \({\overline{x}}\) has an expected value or mean, a standard deviation, and a characteristic shape or form.

22.5.1 Mean

  • Expected Value of \({\overline{x}}\)
    • The mean of the \({\overline{x}}\) random variable is the expected value of \({\overline{x}}\).
    • Let \(E(\overline{x})\) represent the expected value of \({\overline{x}}\) and \({\mu}\) represent the mean of the population from which we are selecting a simple random sample. Then, \(E(\overline{x}) = \mu\)
    • When the expected value of a point estimator equals the population parameter, we say the point estimator is unbiased. Thus, \({\overline{x}}\) is an unbiased estimator of the population mean \({\mu}\).

22.5.2 Standard Deviation

Definition 22.13 In general, standard error \(\sigma_{\overline{x}}\) refers to the standard deviation of a point estimator. The standard error of \({\overline{x}}\) is the standard deviation of the sampling distribution of \({\overline{x}}\). It is the indicator of ‘Sampling Fluctuation.’
  • Standard Deviation of \({\overline{x}}\), \(\sigma_{\overline{x}}\) is given by (22.1)
    • \(\sqrt{\frac{N - n}{N-1}}\) is commonly referred to as the finite population correction factor. With large population, it approaches 1
    • Thus, \(\sigma_{\overline{x}} = \frac{\sigma}{\sqrt{n}}\) becomes good approximation when the sample size is less than or equal to 5% of the population size; that is, \(n/N \leq 0.05\).
    • To further emphasize the difference between \(\sigma_{\overline{x}}\) and \({\sigma}\), we refer to the standard deviation of \({\overline{x}}\), \(\sigma_{\overline{x}}\), as the standard error of the mean.
    • (Sampling Fluctuation) The standard error of the mean is helpful in determining how far the sample mean may be from the population mean.

\[\begin{align} \text{Finite Population:} \sigma_{\overline{x}} &= \sqrt{\frac{N - n}{N-1}}\left(\frac{\sigma}{\sqrt{n}} \right) \\ \text{Infinite Population:} \sigma_{\overline{x}} &= \frac{\sigma}{\sqrt{n}} \end{align} \tag{22.1}\]

Definition 22.14 A sampling error is the difference between a population parameter and a sample statistic.
  • Standard error is a measure of sampling error. There are others, but standard error is, by far, the most commonly used.
    • However, sampling error is NOT the only reason for a difference between the survey estimate and the true value in the population.
    • Another, and arguably more important, reason for this difference is bias.
      • Bias can be introduced when designing the sampling scheme.
      • Most forms of bias cannot be calculated nor measured after the data are collected, and are, therefore, often invisible.
      • Bias must be avoided by using correct procedures at each step of the survey process.
      • Bias has NOTHING to do with sample size which affects only sampling error and standard error.
      • As a result, large sample sizes do NOT eliminate bias. In fact, the larger sample size may increase the likelihood of bias in the data collection.

Refer Effect of Sample Size and Repeat Sampling

Effect of Sample Size vs Repeat SamplingEffect of Sample Size vs Repeat Sampling

Figure 22.1 Effect of Sample Size vs Repeat Sampling

22.6 Synopsis

“ForLater”

If a statistically independent sample of \({n}\) observations \({{x}_1, {x}_2, \ldots, {x}_n}\) is taken from a statistical population with a standard deviation of \(\sigma\), then the mean value calculated from the sample \(\overline{x}\) will have an associated standard error of the mean \(\sigma_\overline{x}\) given by

\[\sigma_\overline{x} = \frac{\sigma}{\sqrt{n}} \tag{22.2}\]

The standard deviation \(\sigma\) of the population being sampled is seldom known. Therefore, \(\sigma_\overline{x}\) is usually estimated by replacing \(\sigma\) with the sample standard deviation \(\sigma_{x}\) instead:

\[\sigma_\overline{x} \approx \frac{\sigma_{x}}{\sqrt{n}} \tag{22.3}\]

As this is only an ‘estimator’ for the true “standard error,” other notations are used, such as:

\[\widehat{\sigma}_\overline{x} = \frac{\sigma_{x}}{\sqrt{n}} \tag{22.4}\]

OR:

\[{s}_\overline{x} = \frac{s}{\sqrt{n}} \tag{22.5}\]

Key:

  • \(\sigma\) : Standard deviation of the population
  • \(\sigma_{x}\) : Standard deviation of the sample
  • \(\sigma_\overline{x}\) : Standard deviation of the mean
    • the standard error
  • \(\widehat{\sigma}_\overline{x}\) : Estimator of the standard deviation of the mean
    • the most often calculated quantity
    • also often colloquially called the standard error

Non-mathematical view:

  • The SD (standard deviation) quantifies scatter — how much the values vary from one another.
  • The SEM (standard error of the mean) quantifies how precisely you know the true mean of the population.
    • It takes into account both the value of the SD and the sample size.
  • Both SD and SEM are in the same units i.e. the units of the data (in contrast, variance has squared units).
  • The SEM, by definition, is always smaller than the SD. (divided by \(\sqrt{n}\))
    • The SEM gets smaller as your samples get larger.
    • So, the mean of a large sample is likely to be closer to the true population mean than is the mean of a small sample.
    • With a huge sample, you will know the value of the mean with a lot of precision even if the data is scattered.
  • The SD does not change predictably as you acquire more data.
    • The SD you compute from a sample is the best possible estimate of the SD of the overall population.
    • As you collect more data, you will assess the SD of the population with more precision. But you cannot predict whether the SD from a larger sample will be bigger or smaller than the SD from a small sample.
    • Technically, variance does not change predictably. Above is a simplification. For details, see Difference between SE and SD

22.6.1 Form

Form of the Sampling Distribution of \({\overline{x}}\)

  • When the population has a normal distribution, the sampling distribution of \({\overline{x}}\) is normally distributed for any sample size.

  • When the population from which we are selecting a random sample does not have a normal distribution, the central limit theorem is helpful in identifying the shape of the sampling distribution of \({\overline{x}}\).

Definition 22.15 Central Limit Theorem: In selecting random samples of size \({n}\) from a population, the sampling distribution of the sample mean \({\overline{x}}\) can be approximated by a normal distribution as the sample size becomes large.

How large the sample size needs to be before the central limit theorem applies and we can assume that the shape of the sampling distribution is approximately normal

  • For most applications, the sampling distribution of \({\overline{x}}\) can be approximated by a normal distribution whenever the sample is size 30 or more.
  • In cases where the population is highly skewed or outliers are present, samples of size 50 may be needed.
  • Finally, if the population is discrete, the sample size needed for a normal approximation often depends on the population proportion.

Ex: EAI

Task of developing a profile of 2500 managers. The characteristics to be identified include the mean annual salary for the managers and the proportion of managers having completed a training.

  • Population
    • Population Size N = 2500 managers
    • Training: 1500/2500 managers have completed Training
    • Salary: \({\mathcal{N}}_{(\mu = 51800, \, \sigma = 4000)}\)
    • Proportion of the population that completed the training program \(p = \frac{1500}{2500} = 0.60\)
  • Suppose that a sample of 30 managers will be used. i.e. \({n=30}\) and 19 Yes for Training
    • Suppose, sample have \({\mathcal{N}}_{(\overline{x} = 51814, \, s = 3348)}\)
    • Also, \(\overline{p} = \frac{x}{n} = \frac{19}{30} = 0.63\)
  • If 500 such samples are taken, where each have their own \({\overline{x}}\)
    • Then their expected value \(E(\overline{x}) = \mu = 51800\)
    • Standard Error \(\sigma_{\overline{x}} = \frac{\sigma}{\sqrt{n}} = \frac{4000}{\sqrt{30}} = 730.3\)
  • Suppose the director believes the sample mean \({\overline{x}}\) will be an acceptable estimate of the population mean \({\mu}\) if the sample mean is within 500 dollars of the population mean.
    • However, it is not possible to guarantee that the sample mean will be within 500 dollars of the population
    • We can reframe the request in probability terms i.e.
      • What is the probability that the sample mean computed using a simple random sample of 30 EAI managers will be within 500 dollars of the population mean
      • i.e. Probability that \(\overline{x} \in [51300, 52300]\)
        • For \(z = \frac{\overline{x} - \mu}{\sigma_{\overline{x}}}\)
        • For \(\overline{x} = 52300 \Rightarrow z = \frac{52300 - 51800}{730.30} = 0.68\)
        • For \(\overline{x} = 51300 \Rightarrow z = \frac{51300 - 51800}{730.30} = -0.68\)
    • \(P_{(51300 \leq \overline{x} \leq 52300)} = P_{(\overline{x} \leq 52300)} - P_{(\overline{x} \leq 51300)} = P_{(z \leq 0.68)} - P_{(z \leq -0.68)} = 0.7517 - 0.2483 = 0.5034\)
      • A simple random sample of 30 EAI managers has a 0.5034 probability of providing a sample mean \({\overline{x}}\) that is within 500 dollars of the population mean.
        • Thus, there is a \(1 − 0.5034 = 0.4966\) probability that the difference between \({\overline{x}}\) and \({\mu}\) will be more than 500 dollars.
        • In other words, a simple random sample of 30 EAI managers has roughly a 50–50 chance of providing a sample mean within the allowable 500 dollars. Perhaps a larger sample size should be considered.
        • Let us explore this possibility by considering the relationship between the sample size and the sampling distribution of \({\overline{x}}\).
  • Impact of \(n = 100\) in place of \(n =30\)
    • First note that \(E(\overline{x}) = \mu\) regardless of the sample size. Thus, the mean of all possible values of \({\overline{x}}\) is equal to the population mean \({\mu}\) regardless of the sample size \({n}\).
    • However, standard error is reduced to \(\sigma_{\overline{x}} = \frac{4000}{\sqrt{100}} = 400\)
    • For \(\overline{x} = 52300 \Rightarrow z = \frac{52300 - 51800}{400} = 1.25\)
    • For \(\overline{x} = 51300 \Rightarrow z = \frac{51300 - 51800}{400} = -1.25\)
    • Thus \(P_{(51300 \leq \overline{x} \leq 52300)} = P_{(\overline{x} \leq 52300)} - P_{(\overline{x} \leq 51300)} = P_{(z \leq 1.25)} - P_{(z \leq -1.25)} = 0.8944 - 0.1056 = 0.7888\)
    • Thus, by increasing the sample size from 30 to 100 EAI managers, we increase the probability of obtaining a sample mean within 500 dollars of the population mean from 0.5034 to 0.7888.

Caution: Here, we took advantage of the fact that the population mean \({\mu}\) and the population standard deviation \({\sigma}\) were known. However, usually these values will be unknown.

“ForLater”

Properties of Point Estimators

Three properties of good point estimators: unbiased, efficiency, and consistency.

\(\theta = \text{the population parameter of interest}\) \(\hat{\theta} = \text{the sample statistic or point estimator of } \theta\)

  • Unbiased
    • If the expected value of the sample statistic is equal to the population parameter being estimated, the sample statistic is said to be an unbiased estimator of the population parameter
  • Efficiency
    • When sampling from a normal population, the standard error of the sample mean is less than the standard error of the sample median. Thus, the sample mean is more efficient than the sample median.
  • Consistency
    • A point estimator is consistent if the values of the point estimator tend to become closer to the population parameter as the sample size becomes larger.

Other Sampling Methods

  • Stratified Random Sampling
  • Cluster Sampling
  • Systematic Sampling
  • Convenience Sampling
  • Judgment Sampling

Validation


23 Interval Estimation

23.1 Overview

23.2 Interval Estimate

Definition 23.1 Because a point estimator cannot be expected to provide the exact value of the population parameter, an interval estimate is often computed by adding and subtracting a value, called the margin of error (MOE), to the point estimate. \(\text{Interval Estimate} = \text{Point Estimate} \pm \text{MOE}_{\gamma}\)
Definition 23.2 Confidence interval is another name for an interval estimate. Normally it is given as \(({\gamma} = 1 - {\alpha})\). Ex: 95% confidence interval
Definition 23.3 The confidence level expressed as a decimal value is the confidence coefficient \(({\gamma} = 1 - {\alpha})\). i.e. 0.95 is the confidence coefficient for a 95% confidence level.

Known SD

In order to develop an interval estimate of a population mean, either the population standard deviation \({\sigma}\) or the sample standard deviation \({s}\) must be used to compute the margin of error. In most applications \({\sigma}\) is not known, and \({s}\) is used to compute the margin of error.

In some applications, large amounts of relevant historical data are available and can be used to estimate the population standard deviation prior to sampling. Also, in quality control applications where a process is assumed to be operating correctly, or ‘in control,’ it is appropriate to treat the population standard deviation as known.

Sampling distribution of \({\overline{x}}\) can be used to compute the probability that \({\overline{x}}\) will be within a given distance of \({\mu}\).

Example: Lloyd Department Store

  • Each week Lloyd Department Store selects a simple random sample of 100 customers in order to learn about the amount spent per shopping trip.
    • With \({x}\) representing the amount spent per shopping trip, the sample mean \({\overline{x}}\) provides a point estimate of \({\mu}\), the mean amount spent per shopping trip for the population of all Lloyd customers. Based on the historical data, Lloyd now assumes a known value of \(\sigma = 20\) for the population standard deviation.
    • During the most recent week, Lloyd surveyed 100 customers \((n = 100)\) and obtained a sample mean of \(\overline{x} = 82\).
    • we can conclude that the sampling distribution of \({\overline{x}}\) follows a normal distribution with a standard error of \(\sigma_{\overline{x}} = \frac{\sigma}{\sqrt{n}} = \frac{20}{\sqrt{100}} =2\).
    • Because the sampling distribution shows how values of \({\overline{x}}\) are distributed around the population mean \({\mu}\), the sampling distribution of \({\overline{x}}\) provides information about the possible differences between \({\overline{x}}\) and \({\mu}\).
    • Using the standard normal probability table, we find that 95% of the values of any normally distributed random variable are within \(\pm 1.96\) standard deviations of the mean i.e. \([\mu - 1.96 \sigma, \mu + 1.96\sigma]\).
      • Thus, 95% of the \({\overline{x}}\) values must be within \(\pm 1.96 \sigma_{\overline{x}}\) of the mean \({\mu}\).
      • In the Lloyd example we know that the sampling distribution of \({\overline{x}}\) is normally distributed with a standard error of \(\sigma_{\overline{x}} =2\).
      • we can conclude that 95% of all \({\overline{x}}\) values obtained using a sample size of \(n = 100\) will be within \((\pm 1.96 \times 2 = \pm 3.92)\) of the population mean \({\mu}\).
    • As given above, sample mean was \(\overline{x} = 82\)
      • Interval estimate of \(\overline{x} = 82 \pm 3.92 = [78.08, 85.92]\)
      • Because 95% of all the intervals constructed using \(\overline{x} = 82 \pm 3.92\) will contain the population mean, we say that we are 95% confident that the interval 78.08 to 85.92 includes the population mean \({\mu}\).
      • We say that this interval has been established at the 95% confidence level.
      • The value 0.95 is referred to as the confidence coefficient, and the interval 78.08 to 85.92 is called the 95% confidence interval.

Interval Estimate of a Population Mean: \({\sigma}\) known is given by equation (23.1)

\[\begin{align} \overline{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}} \end{align} \tag{23.1}\]

where \((1 − \alpha)\) is the confidence coefficient and \(z_{\alpha/2}\) is the z-value providing an area of \(\alpha/2\) in the upper tail of the standard normal probability distribution.

For a 95% confidence interval, the confidence coefficient is \((1 − \alpha) = 0.95\) and thus, \(\alpha = 0.05\). Using the standard normal probability table, an area of \(\alpha/2 = 0.05/2 = 0.025\) in the upper tail provides \(z_{.025} = 1.96\).

# #Find z-value for confidence interval 95% i.e. (1-alpha) = 0.95 i.e. alpha = 0.05
# #To look for Area under the curve towards Right only i.e. alpha/2 = 0.025
p_r_ii <- 0.025
p_l_ii <- 1 - p_r_ii
z_ii <- round(qnorm(p = p_l_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(z) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(z) = ", 
           format(p_r_ii, nsmall = 3), ") at z = ", z_ii, "\n"))
## (Left) P(z) = 0.975 (i.e. (Right) 1-P(z) = 0.025) at z = 1.96
#
# #Critical Value (z) for Common Significance level Alpha (α) or Confidence level (1-α)
xxalpha <- c("10%" = 0.1, "5%" = 0.05, "5/2%" = 0.025, "1%" = 0.01, "1/2%" = 0.005)
#
# #Left Tail Test
round(qnorm(p = xxalpha, lower.tail = TRUE), 4)
##     10%      5%    5/2%      1%    1/2% 
## -1.2816 -1.6449 -1.9600 -2.3263 -2.5758
#
# #Right Tail Test
round(qnorm(p = xxalpha, lower.tail = FALSE), 4)
##    10%     5%   5/2%     1%   1/2% 
## 1.2816 1.6449 1.9600 2.3263 2.5758

23.3 Unknown SD

Definition 23.4 When \({s}\) is used to estimate \({\sigma}\), the margin of error and the interval estimate for the population mean are based on a probability distribution known as the t distribution.

The t distribution is a family of similar probability distributions, with a specific t distribution depending on a parameter known as thedegrees of freedom. As the number of degrees of freedom increases, the difference between the t distribution and the standard normal distribution becomes smaller and smaller.

Just as \(z_{0.025}\) was used to indicate the z value providing a 0.025 area in the upper tail of a standard normal distribution, \(t_{0.025}\) indicates a 0.025 area in the upper tail of a t distribution. In general, the notation \(t_{\alpha/2}\) represents a t value with an area of \(\alpha/2\) in the upper tail of the t distribution.

As the degrees of freedom increase, the t distribution approaches the standard normal distribution. Ex: \(t_{0.025} = 2.262 \, (\text{DOF} = 9)\), \(t_{0.025} = 2.200 \, (\text{DOF} = 60)\), and \(t_{0.025} = 1.96 \, (\text{DOF} = \infty) = z_{0.025}\)

Interval Estimate of a Population Mean: \({\sigma}\) Unknown is given by equation (23.2)

\[\begin{align} \overline{x} \pm t_{\alpha/2} \frac{s}{\sqrt{n}} \end{align} \tag{23.2}\]

where \({s}\) is the sample standard deviation, \((1 − \alpha)\) is the confidence coefficient and \(t_{\alpha/2}\) is the t-value providing an area of \(\alpha/2\) in the upper tail of the t distribution with \({n-1}\) degrees of freedom.

Refer equation (18.12), the expression for the sample standard deviation is

\[{s} = \sqrt{\frac{\sum \left(x_i - \overline{x}\right)^2}{n-1}}\]

Definition 23.5 The number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary. In general, the degrees of freedom of an estimate of a parameter are \((n - 1)\).

Why \((n-1)\) are the degrees of freedom

  • Degrees of freedom refer to the number of independent pieces of information that go into the computation. i.e. \(\{(x_{1}-\overline{x}), (x_{2}-\overline{x}), \ldots, (x_{n}-\overline{x})\}\)
  • However, \(\sum (x_{i}-\overline{x}) = 0\) for any data set.
  • Thus, only \((n − 1)\) of the \((x_{i}-\overline{x})\) values are independent.
    • if we know \((n − 1)\) of the values, the remaining value can be determined exactly by using the condition.

Larger sample sizes are needed if the distribution of the population is highly skewed or includes outliers.

Cal T

# #Like pnorm() is for P(z) and qnorm() is for z, pt() is for P(t) and qt() is for t.
# #Find t-value for confidence interval 95% i.e. (1-alpha) = 0.95 i.e. alpha = 0.05
# #To look for Area under the curve towards Right only i.e. alpha/2 = 0.025
p_r_ii <- 0.025
p_l_ii <- 1 - p_r_ii
#
# #t-tables are unique for different degrees of freedom i.e. for DOF = 9 
dof_ii <- 9
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ", 
           format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 2.2622 (dof = 9)

qt()

# #Like pnorm() is for P(z) and qnorm() is for z, pt() is for P(t) and qt() is for t.
# #Find t-value for confidence interval 95% i.e. (1-alpha) = 0.95 i.e. alpha = 0.05
# #To look for Area under the curve towards Right only i.e. alpha/2 = 0.025
p_r_ii <- 0.025
p_l_ii <- 1 - p_r_ii
#
# #t-tables are unique for different degrees of freedom i.e. for DOF = 9 
dof_ii <- 9
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ", 
           format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 2.2622 (dof = 9)
#
dof_ii <- 60
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ", 
           format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 2.0003 (dof = 60)
#
dof_ii <- 600
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ", 
           format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 1.9639 (dof = 600)
#
# #t-table have Infinity Row which is same as z-table. For DOF >100, it can be used.
dof_ii <- Inf
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ", 
           format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 1.96 (dof = Inf)

#
z_ii <- round(qnorm(p = p_l_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(z) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(z) = ", 
           format(p_r_ii, nsmall = 3), ") at z = ", z_ii, "\n"))
## (Left) P(z) = 0.975 (i.e. (Right) 1-P(z) = 0.025) at z = 1.96

Ex: Credit Card

# #A sample of n = 70 households provided the credit card balances.
xxCreditCards <- c(9430, 7535, 4078, 5604, 5179, 4416, 10676, 1627, 10112, 6567, 13627, 18719, 14661, 12195, 10544, 13659, 7061, 6245, 13021, 9719, 2200, 10746, 12744, 5742, 7159, 8137, 9467, 12595, 7917, 11346, 12806, 4972, 11356, 7117, 9465, 19263, 9071, 3603, 16804, 13479, 14044, 6817, 6845, 10493, 615, 13627, 12557, 6232, 9691, 11448, 8279, 5649, 11298, 4353, 3467, 6191, 12851, 5337, 8372, 7445, 11032, 6525, 5239, 6195, 12584, 15415, 15917, 12591, 9743, 10324)
f_setRDS(xxCreditCards)
bb <- f_getRDS(xxCreditCards)
mean_bb <- mean(bb)
sd_bb <- sd(bb)
dof_bb <- length(bb) - 1L
# #t-value for confidence interval 95% | (1-alpha) = 0.95 | alpha = 0.05 | alpha/2 = 0.025
p_r_ii <- 0.025
p_l_ii <- 1 - p_r_ii
#
dof_ii <- dof_bb
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ", 
           format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 1.9949 (dof = 69)
#
# #Interval Estimate
err_margin_bb <- t_ii * sd_bb / sqrt(length(bb))
est_l <- mean_bb - err_margin_bb
est_r <- mean_bb + err_margin_bb
#
cat(paste0("Normal Sample (n=", length(bb), ", mean=", mean_bb, ", sd=", round(sd_bb, 1),
           "):\n Point Estimate = ", mean_bb, ", Margin of error = ", round(err_margin_bb, 1), 
           ", ", (1-2*p_r_ii) * 100, "% confidence interval is [", 
           round(est_l, 1), ", ", round(est_r, 1), "]"))
## Normal Sample (n=70, mean=9312, sd=4007):
##  Point Estimate = 9312, Margin of error = 955.4, 95% confidence interval is [8356.6, 10267.4]

“ForLater”

  • Determining the Sample Size
  • Population Proportion

Validation


24 Hypothesis Tests

24.1 Overview

24.2 Hypothesis Testing

Definition 24.1 Hypothesis testing is a process in which, using data from a sample, an inference is made about a population parameter or a population probability distribution.

Note:

  • Hypothesis testing is used to determine whether a statement about the value of a population parameter should or should not be rejected.
  • It is the process to check whether the sample information is matching with population information.
  • The hypothesis testing procedure uses data from a sample to test the two competing statements indicated by \({H_0}\) and \({H_a}\)
Definition 24.2 Null Hypothesis \((H_0)\) is a tentative assumption about a population parameter. It is assumed True, by default, in the hypothesis testing procedure.
Definition 24.3 Alternative Hypothesis \((H_a)\) is the complement of the Null Hypothesis. It is concluded to be True, if the Null Hypothesis is rejected.

Note:

  • The conclusion that the alternative hypothesis \((H_a)\) is true is made if the sample data provide sufficient evidence to show that the null hypothesis \((H_0)\) can be rejected.
  • The null and alternative hypotheses are competing statements about the population. Either the null hypothesis \({H_0}\) is true or the alternative hypothesis \({H_a}\) is true, but not both.

24.3 Developing Null and Alternative Hypotheses

All hypothesis testing applications involve collecting a sample and using the sample results to provide evidence for drawing a conclusion.

In some situations it is easier to identify the alternative hypothesis first and then develop the null hypothesis.

  • The Alternative Hypothesis as a Research Hypothesis
    • A new fuel injection system designed to increase the miles-per-gallon rating from the current value 24 miles per gallon.
      • \(H_a : \mu > 24 \iff H_0: \mu \leq 24\)
    • A new teaching method is developed that is believed to be better than the current method.
      • \(H_a : \text{\{New method is better}\} \iff H_0: \text{\{New method is NOT better}\}\)
    • A new sales force bonus plan is developed in an attempt to increase sales.
      • \(H_a : \text{\{New plan increases sales}\} \iff H_0: \text{\{New plan does not increase sales}\}\)
    • A new drug is developed with the goal of lowering blood pressure more than an existing drug.
      • \({H_a}\) : New drug lowers blood pressure more than the existing drug
      • \({H_0}\) : New drug does not provide lower blood pressure than the existing drug
    • In each case, rejection of the null hypothesis \({H_0}\) provides statistical support for the research hypothesis \({H_a}\).
  • The Null Hypothesis as an Assumption to Be Challenged
    • The null hypothesis \({H_0}\) expresses the belief or assumption about the value of the population parameter. The alternative hypothesis \({H_a}\) is that the belief or assumption is incorrect.
    • Ex: The label on a soft drink bottle states that it contains 67.6 fluid ounces.
      • We consider the label correct provided the population mean filling weight for the bottles is at least 67.6 fluid ounces.
      • Without any reason to believe otherwise, we would give the manufacturer the benefit of the doubt and assume that the statement provided on the label is correct.
      • \(H_0 : \mu \geq 67.6 \iff H_a: \mu < 67.6\)
      • If the sample results lead to the conclusion to reject \({H_0}\), the inference that \(H_a: \mu < 67.6\) is true can be made. With this statistical support, the agency is justified in concluding that the label is incorrect and underfilling of the bottles is occurring. Appropriate action to force the manufacturer to comply with labeling standards would be considered.
      • However, if the sample results indicate \({H_0}\) cannot be rejected, the assumption that the labeling is correct cannot be rejected. With this conclusion, no action would be taken.
      • A product information is usually assumed to be true and stated as the null hypothesis. The conclusion that the information is incorrect can be made if the null hypothesis is rejected.
    • Same situation, from the point of view of the manufacturer
      • The company does not want to underfill the containers (legal requirement). However, the company does not want to overfill containers either because it would be an unnecessary cost.
      • \(H_0 : \mu = 67.6 \iff H_a: \mu \neq 67.6\)
      • If the sample results lead to the conclusion to reject \({H_0}\), the inference is made that \(H_a: \mu \neq 67.6\) is true. We conclude that the bottles are not being filled properly and the production process should be adjusted.
      • However, if the sample results indicate \({H_0}\) cannot be rejected, the assumption that the process is functioning properly cannot be rejected. In this case, no further action would be taken.

24.4 Three forms of hypotheses

For hypothesis tests involving a population mean, we let \({\mu}_0\) denote the hypothesized value and we must choose one of the following three forms for the hypothesis test.

Alternative is One-Sided, if it states that a parameter is larger or smaller than the null value. Alternative is Two-sided, if it states that the parameter is different from the null value.

Definition 24.4 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu} \geq {\mu}_0 \iff {H_a}: {\mu} < {\mu}_0\)
Definition 24.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)
Definition 24.6 \(\text{\{Two Tail Test \} } \thinspace {H_0} :{\mu} = {\mu}_0 \iff {H_a}: {\mu} \neq {\mu}_0\)

Refer Equality in Hypothesis

Exercises

  • The manager of an automobile dealership is considering a new bonus plan designed to increase sales volume. Currently, the mean sales volume is 14 automobiles per month. The manager wants to conduct a research study to see whether the new bonus plan increases sales volume.
    • Solution: \(H_0 : \mu \leq 14 \iff H_a: \mu > 14\)
  • A director of manufacturing must convince management that a proposed manufacturing method reduces costs before the new method can be implemented. The current production method operates with a mean cost of 220 dollars per hour.
    • Solution: \(H_0 : \mu \geq 220 \iff H_a: \mu < 220\)

24.5 Type I and Type II Errors

Refer Type I and Type II Errors (B12)

Ideally the hypothesis testing procedure should lead to the acceptance of \({H_0}\) when \({H_0}\) is true and the rejection of \({H_0}\) when \({H_a}\) is true. Unfortunately, the correct conclusions are not always possible. Because hypothesis tests are based on sample information, we must allow for the possibility of errors.

Type-I $(\alpha)$ and Type-II $(\beta)$ Errors

Figure 24.1 Type-I \((\alpha)\) and Type-II \((\beta)\) Errors

Definition 24.7 The error of rejecting \({H_0}\) when it is true, is Type I error \(({\alpha})\).
Definition 24.8 The error of accepting \({H_0}\) when it is false, is Type II error \(({\beta})\).
Definition 24.9 The level of significance \((\alpha)\) is the probability of making a Type I error when the null hypothesis is true as an equality.

23.3 The confidence level expressed as a decimal value is the confidence coefficient \(({\gamma} = 1 - {\alpha})\). i.e. 0.95 is the confidence coefficient for a 95% confidence level.

In practice, the person responsible for the hypothesis test specifies the level of significance. By selecting \({\alpha}\), that person is controlling the probability of making a Type I error.

  • Most common value are \({\alpha} = 0.05, 0.01\).
    • For example, a significance level of \({\alpha} = 0.05\) indicates a 5% risk of concluding that a difference exists when there is no actual difference.
    • Lower significance levels indicate that you require stronger evidence before you will reject the null hypothesis.
    • If the cost of making a Type I error is high, small values of \({\alpha}\) are preferred. Ex: \({\alpha} = 0.01\)
    • If the cost of making a Type I error is not too high, larger values of \({\alpha}\) are typically used. Ex: \(\alpha = 0.05\)
Definition 24.10 Applications of hypothesis testing that only control for the Type I error \((\alpha)\) are called significance tests.

Although most applications of hypothesis testing control for the probability of making a Type I error, they do not always control for the probability of making a Type II error. Hence, if we decide to accept \({H_0}\), we cannot determine how confident we can be with that decision. Because of the uncertainty associated with making a Type II error when conducting significance tests, statisticians usually recommend that we use the statement "do not reject \({H_0}\)" instead of “accept \({H_0}\).” Using the statement “do not reject \({H_0}\)” carries the recommendation to withhold both judgment and action. In effect, by not directly accepting \({H_0}\), the statistician avoids the risk of making a Type II error.

24.5.1 Additional

Refer figure 24.1

  1. Type I (\({\alpha}\)):
    • False Positive: Rejecting a True \({H_0}\) thus claiming False \({H_a}\)
    • An alpha error is when you mistakenly reject the Null and believe that something significant happened
      • i.e. you believe that the means of the two populations are different when they are not
      • i.e. you report that your findings are significant when in fact they have occurred by chance
    • The probability of making a type I error is represented by alpha level \({\alpha}\), which is the p-value below which you reject the null hypothesis
      • The p-value is the actual risk you have in being wrong if you reject the null
        • You would like that to be low
        • This p-value is compared with and should be lower than the alpha
        • A p-value of 0.05 indicates that you are willing to accept a 5% chance that you are wrong when you reject the null hypothesis. You can reduce your risk of committing a type I error by using a lower value for p. For example, a p-value of 0.01 would mean there is a 1% chance of committing a Type I error.
        • However, using a lower value for alpha means that you will be less likely to detect a true difference if one really exists (thus risking a type II error).
    • \({\alpha}\) is Significance Level (for \((1-{\alpha})\) confidence of not committing Type 1 error)
      • It is the boundary for specifying a statistically significant finding when interpreting the p-value
    • NOTE: Fail to reject True \({H_0}\) (\(\approx\) accept) is the correct decision shown in Top Left Quadrant
  2. Type II (\({\beta}\)):
    • False Negative: Failing to reject (\(\approx\) accept) a False \({H_0}\)
    • A beta error is when you fail to reject the null when you should have
      • i.e. you missed something significant and failed to take action
      • i.e. you conclude that there is not a significant effect, when actually there really is
      • You can decrease your risk of committing a type II error by ensuring your test has enough power.
      • You can do this by ensuring your sample size is large enough to detect a practical difference when one truly exists.

24.28 The probability of correctly rejecting \({H_0}\) when it is false is called the power of the test. For any particular value of \({\mu}\), the power is \(1 - \beta\).

  • The consequences of making a type I error mean that changes or interventions are made which are unnecessary, and thus waste time, resources, etc.
  • Type II errors typically lead to the preservation of the status quo (i.e. interventions remain the same) when change is needed.
  • Generally max 5% \({\alpha}\) and max 20% \({\beta}\) errors are recommended

24.6 Known SD

24.6.1 Test Statistic

Definition 24.11 Test statistic is a number calculated from a statistical test of a hypothesis. It shows how closely the observed data match the distribution expected under the null hypothesis of that statistical test. It helps determine whether a null hypothesis should be rejected.

20.4 The probability distribution for a random variable describes how probabilities are distributed over the values of the random variable.

The test statistic summarizes the observed data into a single number using the central tendency, variation, sample size, and number of predictor variables in the statistical model. Refer Table 24.1

Table 24.1: (C09V01) Test Statistic
Test statistic \({H_0}\) and \({H_a}\) Statistical tests that use it
t-value Null: The means of two groups are equal T-test, Regression tests
Alternative: The means of two groups are not equal
z-value Null: The means of two groups are equal Z-test
Alternative:The means of two groups are not equal
F-value Null: The variation among two or more groups is greater than or equal to the variation between the groups ANOVA, ANCOVA, MANOVA
Alternative: The variation among two or more groups is smaller than the variation between the groups
\({\chi}^2\text{-value}\) Null: Two samples are independent Chi-squared test, Non-parametric correlation tests
Alternative: Two samples are not independent (i.e. they are correlated)

24.6.2 Tails

18.17 A tail refers to the tapering sides at either end of a distribution curve.

Definition 24.12 A one-tailed test and a two-tailed test are alternative ways of computing the statistical significance of a parameter inferred from a data set, in terms of a test statistic.

One tailed-tests are concerned with one side of a statistic. Thus, one-tailed tests deal with only one tail of the distribution, and the z-score is on only one side of the statistic. Whereas, Two-tailed tests deal with both tails of the distribution, and the z-score is on both sides of the statistic.

In a one-tailed test, the area under the rejection region is equal to the level of significance, \({\alpha}\). When the rejection region is below the acceptance region, we say that it is a left-tail test. Similarly, when the rejection region is above the acceptance region, we say that it is a right-tail test.

In the two-tailed test, there are two critical regions, and the area under each region is \(\frac{\alpha}{2}\).

One-Tail vs. Two-Tail

  • One-tailed tests have more statistical power to detect an effect in one direction than a two-tailed test with the same design and significance level.
    • One-tailed tests occur most frequently for studies where one of the following is true:
      • Effects can exist in only one direction.
      • Effects can exist in both directions but the researchers only care about an effect in one direction.
  • The disadvantage of one-tailed tests is that they have no statistical power to detect an effect in the other direction.
    • Whereas, A two-tailed hypothesis test is designed to show whether the sample mean is significantly greater than OR significantly less than the mean of a population.
      • A two-tailed test is designed to examine both sides of a specified data range as designated by the probability distribution involved.
  • Thumb rule
    • Consider both directions when deciding if you should run a one tailed test or two. If you can skip one tail and it is not irresponsible or unethical to do so, then you can run a one-tailed test.
    • Two-tail test is done when you do not know about direction, so you test for both sides.

24.6.3 One-tailed Test

One-tailed tests about a population mean take one of the following two forms:

24.4 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu} \geq {\mu}_0 \iff {H_a}: {\mu} < {\mu}_0\)

24.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)

Definition 24.13 One-tailed test is a hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in one tail of its sampling distribution.

Example: The label on a can of Hilltop Coffee states that the can contains 3 pounds of coffee. As long as the population mean filling weight is at least 3 pounds per can, the rights of consumers will be protected. Thus, the government (FTC) interprets the label information on a large can of coffee as a claim by Hilltop that the population mean filling weight is at least 3 pounds per can.

  • Develop the null and alternative hypotheses for the test
    • \(H_0 : \mu \geq 3 \iff H_a: \mu < 3\)
  • Take a Sample
    • Suppose a sample of 36 cans of coffee is selected and the sample mean \({\overline{x}}\) is computed as an estimate of the population mean \({\mu}\). If the value of the sample mean \({\overline{x}}\) is less than 3 pounds, the sample results will cast doubt on the null hypothesis.
    • What we want to know is how much less than 3 pounds must \({\overline{x}}\) be before we would be willing to declare the difference significant and risk making a Type I error by falsely accusing Hilltop of a label violation. A key factor in addressing this issue is the value the decision maker selects for the level of significance.
  • Specify the level of significance \({\alpha}\)
    • FTC is willing to risk a 1% chance of making such an error i.e. \(\alpha 0.01\)
  • Compute the value of test statistic
    • Assume, known \({\sigma} = 0.18\) and Normal distribution
    • Refer equation (22.1), standard error of \({\overline{x}}\) is \({\sigma}_{\overline{x}} = \frac{{\sigma}}{\sqrt{n}} = \frac{0.18}{\sqrt{36}} = 0.03\)
    • Because the sampling distribution of \({\overline{x}}\) is normally distributed, \(z = \frac{\overline{x} - {\mu}_0}{{\sigma}_{\overline{x}}} = \frac{\overline{x} - 3}{0.03}\)
    • Because the sampling distribution of x is normally distributed, the sampling distribution of \({z}\) is a standard normal distribution.
    • A value of \(z = −1\) means that the value of \({\overline{x}}\) is one standard error below the hypothesized value of the mean. For a value of \(z = −2\), it would be two standard errors below the mean, and so on.
    • We can use the standard normal probability table to find the lower tail probability \({P_{\left(z\right)}}\) corresponding to any \({z}\) value. Refer Get P(z) by pnorm() or z by qnorm()
      • Ex: \(P_{\left(z = -3\right)} = 0.0013\)
      • As a result, the probability of obtaining a value of \({\overline{x}}\) that is 3 or more standard errors below the hypothesized population mean \({\mu}_0 = 3\) is also 0.0013. i.e. Such a result is unlikely if the null hypothesis is true.
Definition 24.14 If \({\sigma}\) is known, the standard normal random variable \({z}\) is used as test statistic to determine whether \({\overline{x}}\) deviates from the hypothesized value of \({\mu}\) enough to justify rejecting the null hypothesis. Refer equation (24.1) \(\to z = \frac{\overline{x} - {\mu}_0}{{\sigma}_{\overline{x}}} = \frac{\overline{x} - {\mu}_0}{{\sigma}/\sqrt{n}}\)

\[z = \frac{\overline{x} - {\mu}_0}{{\sigma}_{\overline{x}}} = \frac{\overline{x} - {\mu}_0}{{\sigma}/\sqrt{n}} \tag{24.1}\]

The key question for a lower tail test is, How small must the test statistic \({z}\) be before we choose to reject the null hypothesis

Two approaches can be used to answer this: the p-value approach and the critical value approach.

24.6.3.1 p-value approach

Definition 24.15 The p-value approach uses the value of the test statistic \({z}\) to compute a probability called a p-value.
Definition 24.16 A p-value is a probability that provides a measure of the evidence against the null hypothesis provided by the sample. The p-value is used to determine whether the null hypothesis should be rejected. Smaller p-values indicate more evidence against \({H_0}\).

p-value (p) is the probability of obtaining a result equal to or more extreme than was observed in the data. It is the probability of observing the result given that the null hypothesis is true. A small p-value indicates the value of the test statistic is unusual given the assumption that \({H_0}\) is true.

For a lower tail test, the p-value is the probability of obtaining a value for the test statistic as small as or smaller than that provided by the sample. - we use the standard normal distribution to find the probability that \({z}\) is less than or equal to the value of the test statistic. - After computing the p-value, we must then decide whether it is small enough to reject the null hypothesis; this decision involves comparing the p-value to the level of significance.

For the Hilltop Coffee Example

  • Suppose the sample of 36 Hilltop coffee cans provides a sample mean of \({\overline{x}}\) = 2.92 pounds.
    • Is \(\overline{x} = 2.92\) small enough to cause us to reject \({H_0}\)
  • Because this is a lower tail test, the p-value is the area under the standard normal curve for values of \({z}\) less than or equal to the value of the test statistic.
    • Refer equation (24.1), \(z = \frac{\overline{x} - {\mu}_0}{{\sigma}/\sqrt{n}} = \frac{2.92 - 3}{0.18/\sqrt{36}} = -2.67\)
    • Thus, the p-value is the probability that \({z}\) is less than or equal to −2.67 (the lower tail area corresponding to the value of the test statistic).
  • Refer Get P(z) by pnorm() or z by qnorm(), to get the p-value
    • \(P_{\left(\overline{x} = 2.92\right)} = P_{\left(z = -2.67\right)} = 0.0038\)
    • This p-value does not provide much support for the null hypothesis, but is it small enough to cause us to reject \({H_0}\)
  • Compare p-value with Level of significance \(\alpha = 0.01\)
    • Because .0038 is less than or equal to \(\alpha = 0.01\), we reject \({H_0}\). Therefore, we find sufficient statistical evidence to reject the null hypothesis at the .01 level of significance.
    • We can conclude that Hilltop is underfilling the cans.

Rejection Rule: Reject \({H_0}\) if p-value \(\leq {\alpha}\)

Further, in this case, we would reject \({H_0}\) for any value of \({\alpha} \geq (p = 0.0038)\). For this reason, the p-value is also called the observed level of significance.

  • (Aside)
    • For the p-value approach, the likelihood (p-value) of the numerical value of the test statistic is compared to the specified significance level (\({\alpha}\)) of the hypothesis test.
    • The p-value corresponds to the probability of observing sample data at least as extreme as the actually obtained test statistic. Small p-values provide evidence against the null hypothesis. The smaller (closer to 0) the p-value, the stronger is the evidence against the null hypothesis.
    • “If the null hypothesis is true, what is the probability that we would observe a more extreme test statistic in the direction of the alternative hypothesis than we did”
    • Ex: (criminal trials) “If the defendant is innocent, what is the chance that we would observe such extreme criminal evidence”
    • pnorm() returns the cumulative probability up to q (i.e. \({\overline{x}}\)) for a normal distribution with a given mean \({\mu}\) and standard deviation \({\sigma}\).

24.6.3.2 Critical value approach

Definition 24.17 The critical value approach requires that we first determine a value for the test statistic called the critical value.
Definition 24.18 Critical value is the value that is compared with the test statistic to determine whether \({H_0}\) should be rejected. Significance level \({\alpha}\), or confidence level (\(1 - {\alpha}\)), dictates the critical value (\(Z\)), or critical limit. Ex: For Upper Tail Test, \(Z_{{\alpha} = 0.05} = 1.645\).

For a lower tail test, the critical value serves as a benchmark for determining whether the value of the test statistic is small enough to reject the null hypothesis. - Critical value is the value of the test statistic that corresponds to an area of \({\alpha}\) (the level of significance) in the lower tail of the sampling distribution of the test statistic. - In other words, the critical value is the largest value of the test statistic that will result in the rejection of the null hypothesis.

Hilltop Coffee Example

  • The sampling distribution for the test statistic \({z}\) is a standard normal distribution.
    • Therefore, the critical value is the value of the test statistic that corresponds to an area of \(\alpha = 0.01\) in the lower tail of a standard normal distribution.
    • Using the standard normal probability table, we find that \(P_{\left(z\right)} = 0.01\) for \(z_{\alpha = 0.01} = −2.33\)
    • Refer Get P(z) by pnorm() or z by qnorm()
    • Thus, if the sample results in a value of the test statistic that is less than or equal to −2.33, the corresponding p-value will be less than or equal to .01; in this case, we should reject the null hypothesis.
  • Compare test statistic with z-value
    • Because \((z = -2.67) < (z_{\alpha = 0.01} = −2.33)\), we can reject \({H_0}\)
    • We can conclude that Hilltop is underfilling the cans.

Rejection Rule: Reject \({H_0}\) if \(z \leq z_{\alpha}\)

24.6.3.3 Summary

The p-value approach to hypothesis testing and the critical value approach will always lead to the same rejection decision; that is, whenever the p-value is less than or equal to \({\alpha}\), the value of the test statistic will be less than or equal to the critical value.

  • The advantage of the p-value approach is that the p-value tells us how significant the results are (the observed level of significance).
    • If we use the critical value approach, we only know that the results are significant at the stated level of significance.

For upper tail test The test statistic \({z}\) is still computed as earlier. But, for an upper tail test, the p-value is the probability of obtaining a value for the test statistic as large as or larger than that provided by the sample. Thus, to compute the p-value for the upper tail test in the \({\sigma}\) known case, we must use the standard normal distribution to find the probability that \({z}\) is greater than or equal to the value of the test statistic. Using the critical value approach causes us to reject the null hypothesis if the value of the test statistic is greater than or equal to the critical value \(z_{\alpha}\); in other words, we reject \({H_0}\) if \(z \geq z_{\alpha}\).

24.6.3.4 Acceptance and Rejection Region

23.1 Because a point estimator cannot be expected to provide the exact value of the population parameter, an interval estimate is often computed by adding and subtracting a value, called the margin of error (MOE), to the point estimate. \(\text{Interval Estimate} = \text{Point Estimate} \pm \text{MOE}_{\gamma}\)

Definition 24.19 A acceptance region (confidence interval), is a set of values for the test statistic for which the null hypothesis is accepted. i.e. if the observed test statistic is in the confidence interval then we accept the null hypothesis and reject the alternative hypothesis.

\[Z = \frac {{\overline{x}} - {\mu}}{{\sigma}/{\sqrt{n}}} \quad \iff {\mu} = {\overline{x}} - Z \frac{{\sigma}}{\sqrt{n}} \quad \to {\mu} = {\overline{x}} \pm Z \frac{{\sigma}}{\sqrt{n}} \quad \to {\mu} \approx {\overline{x}} \pm Z \frac{{s}}{\sqrt{n}} \tag{24.2}\]

Definition 24.20 The margin of error tells how far the original population means might be from the sample mean. It is given by \(Z\frac{{\sigma}}{\sqrt{n}}\)
Definition 24.21 A rejection region (critical region), is a set of values for the test statistic for which the null hypothesis is rejected. i.e. if the observed test statistic is in the critical region then we reject the null hypothesis and accept the alternative hypothesis.

24.6.4 Two-tailed Test

24.6 \(\text{\{Two Tail Test \} } \thinspace {H_0} :{\mu} = {\mu}_0 \iff {H_a}: {\mu} \neq {\mu}_0\)

Definition 24.22 Two-tailed test is a hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in either tail of its sampling distribution.

Ex: Golf Company, mean driving distance is 295 yards i.e. \(({\mu}_0 = 295)\)

  • \(H_0 : \mu = 295 \iff H_a: \mu \neq 295\)
  • The quality control team selected \(\alpha = 0.05\) as the level of significance for the test.
  • From previous tests, assume known \({\sigma} = 12\)
  • For a sample size \(n = 50\)
    • Standard Error of \({\overline{x}}\) is \({\sigma}_{\overline{x}} = \frac{{\sigma}}{\sqrt{n}} = \frac{12}{\sqrt{50}} = 1.7\)
    • Central Limit Theorem, allows us to conclude that the sampling distribution of \({\overline{x}}\) can be approximated by a normal distribution.
  • Suppose for the sample, \(\overline{x} = 297.6\)

“ForLater” - This part needs to be moved to the Next Chapter.

  • p-value approach
    • For a two-tailed test, the p-value is the probability of obtaining a value for the test statistic as unlikely as or more unlikely than that provided by the sample.
    • Refer equation (24.1), \(z = \frac{\overline{x} - {\mu}_0}{{\sigma}/\sqrt{n}} = \frac{297.6 - 295}{12/\sqrt{50}} = 1.53\)
    • Now to compute the p-value we must find the probability of obtaining a value for the test statistic at least as unlikely as \(z = 1.53\).
      • Clearly values of \(z \geq 1.53\) are at least as unlikely.
      • But, because this is a two-tailed test, values of \(z \leq −1.53\) are also at least as unlikely as the value of the test statistic provided by the sample.
    • Refer Get P(z) by pnorm() or z by qnorm(), to get the p-value
      • \(P_{\left(z\right)} = P_{\left(z \leq -1.53\right)} + P_{\left(z \geq 1.53\right)}\)
      • \(P_{\left(z\right)} = 2 \times P_{\left(z \geq 1.53\right)}\), Because the normal curve is symmetric
      • \(P_{\left(z\right)} = 2 \times 0.0630 = 0.1260\)
    • Compare p-value with Level of significance \(\alpha = 0.05\)
      • We do not reject \({H_0}\) because the \((\text{p-value}= 0.1260) > (\alpha = 0.05)\)
      • Because the null hypothesis is not rejected, no action will be taken.
  • critical value approach
    • The critical values for the test will occur in both the lower and upper tails of the standard normal distribution.
    • With a level of significance of \(\alpha = 0.05\), the area in each tail corresponding to the critical values is \(\alpha/2 = 0.025\).
    • Refer Get P(z) by pnorm() or z by qnorm()
      • Using the standard normal probability table, we find that \(P_{\left(z\right)} = 0.025\) for \(-z_{\alpha/2 = 0.025} = −1.96\) and \(z_{\alpha/2 = 0.025} = 1.96\)
    • Compare test statistic with z-value
      • Because \((z = 1.53)\) is NOT greater than \((z_{\alpha/2 = 0.025} = 1.96)\), we cannot reject \({H_0}\)

Rejection Rule: Reject \({H_0}\) if \(z \leq -z_{\alpha/2}\) or \(z \geq z_{\alpha/2}\)

(Online, might be wrong) Ex: Assume that for a Population with mean \({\mu}\) unknown and standard deviation \({\sigma} = 15\), if we take a sample \({n = 100}\) its sample mean is \({\overline{x}} = 42\).

Assume \({\alpha} = 0.05\) and if we are conducting a Two Tail Test, \(Z_{\alpha/2 = 0.05/2} = 1.960\)

  • If we take a different sample of same size or a sample of different size, the sample mean calculated for those would be different.
  • So, our sample mean \({\overline{x}}\) might not be the true population mean \({\mu}\)
  • Thus, a range is inferred using the sample size, the sample mean, and the population standard deviation, and it is assumed that the true population means falls under this interval. This interval is called a confidence interval.
  • Confidence interval is calculated using critical limit \({z}\), and thus are calculated for specific significance level \({\alpha}\)
  • Margin of Error \(= Z\frac{{\sigma}}{\sqrt{n}} = 1.96 \times 15 /\sqrt{100} = 2.94\)

As shown in the equation (24.2), our interval range is \(\mu = \overline{X} \pm 2.94 = 42 \pm 2.94 \rightarrow \mu \in (39.06, 44.94)\)

We are 95% confident that the population mean will be between 39.04 and 44.94

Note that a 95% confidence interval does not mean there is a 95% chance that the true value being estimated is in the calculated interval. Rather, given a population, there is a 95% chance that choosing a random sample from this population results in a confidence interval which contains the true value being estimated.

24.7 Steps of Hypothesis Testing

Common Steps

  1. Develop the null and alternative hypotheses.
  2. Specify the level of significance.
  3. Collect the sample data and compute the value of the test statistic.

p-Value Approach Step

  1. Use the value of the test statistic to compute the p-value.
  2. Reject \({H_0}\) if the p-value \(\leq {\alpha}\).
  3. Interpret the statistical conclusion in the context of the application.
Definition 24.23 p-value Approach: Form Hypothesis | Specify \({\alpha}\) | Calculate test statistic | Calculate p-value | Compare p-value with \({\alpha}\) | Interpret

Critical Value Approach

  1. Use the level of significance to determine the critical value and the rejection rule.
  2. Use the value of the test statistic and the rejection rule to determine whether to reject \({H_0}\).
  3. Interpret the statistical conclusion in the context of the application.

24.8 Relationship Between Interval Estimation and Hypothesis Testing

Refer equation (23.1), For the \({\sigma}\) known case, the \({(1 - \alpha)}\%\) confidence interval estimate of a population mean is given by

\[\overline{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}}\]

We know that \(100 {(1 - \alpha)}\%\) of the confidence intervals generated will contain the population mean and \(100 {\alpha}\%\) of the confidence intervals generated will not contain the population mean.

Thus, if we reject \({H_0}\) whenever the confidence interval does not contain \({\mu}_0\), we will be rejecting the null hypothesis when it is true \((\mu = {\mu}_0)\) with probability \({\alpha}\).

The level of significance is the probability of rejecting the null hypothesis when it is true. So constructing a \(100 {(1 - \alpha)}\%\) confidence interval and rejecting \({H_0}\) whenever the interval does not contain \({\mu}_0\) is equivalent to conducting a two-tailed hypothesis test with \({\alpha}\) as the level of significance.

Ex: Golf company

  • For \({\alpha} = 0.05\), 95% confidence interval estimate of the population mean is
    • \({\overline{x}} \pm z_{0.025} \frac{{\sigma}}{\sqrt{n}} = 297.6 \pm 1.96 \frac{12}{\sqrt{50}} = 297.6 \pm 3.3\)
    • Interval: \([294.3, 300.9]\)
    • We can conclude with 95% confidence that the mean distance for the population of golf balls is between 294.3 and 300.9 yards.
    • Because the hypothesized value for the population mean, \({\mu}_0 = 295\), is in this interval, the hypothesis testing conclusion is that the null hypothesis, \({H_0: {\mu} = 295}\), cannot be rejected.

“ForLater” - Exercises

24.9 Unknown SD

Definition 24.24 If \({\sigma}\) is unknown, the sampling distribution of the test statistic follows the t distribution with \((n − 1)\) degrees of freedom. Refer equation (24.3) \(\to t = \frac{{\overline{x}} - {\mu}_0}{{s}/\sqrt{n}}\)

\[t = \frac{{\overline{x}} - {\mu}_0}{{s}/\sqrt{n}} \tag{24.3}\]

One-Tailed Test

  • Ex: Heathrow Airport, testing for mean rating 7 i.e. \({\mu}_0 = 7\)
    • \({H_0}: {\mu} \leq 7 \iff {H_a} \geq 7\)
    • Sample: \({\overline{x}} = 7.25, s = 1.052, n = 60\)
    • \({\alpha} = 0.05\)
    • Refer equation (24.3), \(t = \frac{\overline{x} - {\mu}_0}{s/\sqrt{n}} = \frac{7.25 - 7}{1.052/\sqrt{60}} = 1.84\)
    • \(\text{DOF} = n-1 = 60 -1 = 59\)
    • Refer For P(t), find t by qt() and This is a Right Tail Test
      • \({P_{\left(t \geq 1.84\right)}} = 0.0354\) i.e. between 0.05 and 0.025
    • Comparison
      • \({(P_{\left(t \geq 1.84\right)}} = 0.035) < ({\alpha} = 0.05)\)
      • Thus, we can reject the \({H_0}\) and can accept the \({H_a}\)

Critical Value Approach - \((\text{DOF = 59}), \, t_{{\alpha} = 0.05} = 1.671\) - Because \((t = 1.84) > (t_{{\alpha} = 0.05} = 1.671)\), Reject \({H_0}\)

# #Like pnorm() is for P(z) and qnorm() is for z, pt() is for P(t) and qt() is for t.
#
# #p-value approach: Find Commulative Probability P corresponding to the given t-value & DOF=59
pt(q = 1.84, df = 59, lower.tail = FALSE)
## [1] 0.03539999
#
# #Critical Value: t-value for which Area under the curve towards Right is alpha=0.05 & DOF=59
qt(p = 0.05, df = 59, lower.tail = FALSE)
## [1] 1.671093

Two Tailed Test

  • Ex: Holiday Toys, testing for sale of 40 units, i.e. \({\mu}_0 = 40\)
    • \({H_0}: {\mu} = 40 \iff {H_a} \neq 40\)
    • Sample: \({\overline{x}} = 37.4, s = 11.79, n = 25\)
    • \({\alpha} = 0.05\)
    • Refer equation (24.3), \(t = \frac{\overline{x} - {\mu}_0}{s/\sqrt{n}} = \frac{37.4 - 40}{11.79/\sqrt{25}} = -1.10\)
    • \(\text{DOF} = n-1 = 25 -1 = 24\)
    • Because we have a two-tailed test, the p-value is two times the area under the curve of the t distribution for \(t \leq -1.10\)
      • \(P_{\left(t\right)} = P_{\left(t \leq -1.10\right)} + P_{\left(z \geq 1.10\right)}\)
      • \(P_{\left(t\right)} = 2 \times P_{\left(t \leq -1.10\right)}\), Because the normal curve is symmetric
      • \(P_{\left(t\right)} = 2 \times 0.1411 = 0.2822\) i.e. between 2 * (0.20 and 0.10) or (0.40, 0.20)
    • Comparison
      • \((P_{\left(t\right)} = 0.282)2 > ({\alpha} = 0.05)\)
      • Thus, we cannot reject the \({H_0}\)

Critical Value Approach - \((\text{DOF = 24})\) - We find that \(P_{\left(t\right)} = 0.025\) for \(-t_{\alpha/2 = 0.025} = -2.064\) and \(t_{\alpha/2 = 0.025} = 2.064\) - Compare test statistic with z-value - Because \((t = -1.10)\) is NOT lower than \((-z_{\alpha/2 = 0.025} = -2.064)\), we cannot reject \({H_0}\)

24.10 Population Proportions

24.10.1 Hypothesis

Using \({p}_0\) to denote the hypothesized value for the population proportion, the three forms for a hypothesis test about a population proportion \({p}\) are :

Definition 24.25 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {p} \geq {p}_0 \iff {H_a}: {p} < {p}_0\)
Definition 24.26 \(\text{\{Right or Upper\} } {H_0} : {p} \leq {p}_0 \iff {H_a}: {p} > {p}_0\)
Definition 24.27 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {p} = {p}_0 \iff {H_a}: {p} \neq {p}_0\)

Hypothesis tests about a population proportion are based on the difference between the sample proportion \({\overline{p}}\) and the hypothesized population proportion \({p}_0\)

The sampling distribution of \({\overline{p}}\), the point estimator of the population parameter \({p}\), is the basis for developing the test statistic.

When the null hypothesis is true as an equality, the expected value of \({\overline{p}}\) equals the hypothesized value \({p}_0\) i.e. \(E_{(\overline{p})} = {p}_0\)

The standard error of \({\overline{p}}\) is given in equation (24.4)

\[{\sigma}_{\overline{p}} = \sqrt{\frac{{p}_0 (1 - {p}_0)}{n}} \tag{24.4}\]

If \(np \geq 5\) and \(n(1 − p) \geq 5\), the sampling distribution of \({p}\) can be approximated by a normal distribution. Under these conditions, which usually apply in practice, the quantity \({z}\) as given in equation (24.5) has a standard normal probability distribution.

Test Statistic for Hypothesis Tests about a Population Proportion :

\[z = \frac{{\overline{p}} - {p}_0}{{\sigma}_{\overline{p}}} = \frac{{\overline{p}} - {p}_0}{\sqrt{\frac{{p}_0 (1 - {p}_0)}{n}}} \tag{24.5}\]

Example: Pine Creek: Determine whether the proportion of women golfers increased from \(p_0 = 0.20\)

24.26 \(\text{\{Right or Upper\} } {H_0} : {p} \leq {p}_0 \iff {H_a}: {p} > {p}_0\)

  • Count of Success \(({x})\) is Number of Women
  • \(\{n = 400, x = 100\} \to {\overline{p}} = {n}/{x} = 0.25\)
  • (24.5) \(z = \frac{{\overline{p}} - {p}_0}{{\sigma}_{\overline{p}}} = \frac{{\overline{p}} - {p}_0}{\sqrt{\frac{{p}_0 (1 - {p}_0)}{n}}}\)
    • \(z = \frac{0.25 - 0.20}{\sqrt{\frac{0.20 (1 - 0.20)}{400}}} = 2.50\)
      • {0.25 - 0.20}/{sqrt(0.20 * {1 - 0.20} / 400)} \(\#\mathcal{R}\)
  • \({}^U\!P_{(z = 2.50)} = 0.0062\)
    • pnorm(q = 2.50, lower.tail = FALSE) \(\#\mathcal{R}\)
  • Compare with \({\alpha} = 0.05\)
    • \({}^U\!P_{(z)} < {\alpha} \to {H_0}\) is rejected i.e. the proportions are different
    • We can conclude that the proportion of women players has increased.

24.11 Hypothesis Testing and Decision Making

If the purpose of a hypothesis test is to make a decision when \({H_0}\) is true and a different decision when \({H_a}\) is true, the decision maker may want to, and in some cases be forced to, take action with both the conclusion do not reject \({H_0}\) and the conclusion reject \({H_0}\).

If this situation occurs, statisticians generally recommend controlling the probability of making a Type II error. With the probabilities of both the Type I and Type II error controlled, the conclusion from the hypothesis test is either to accept \({H_0}\) or reject \({H_0}\). In the first case, \({H_0}\) is concluded to be true, while in the second case, \({H_a}\) is concluded true. Thus, a decision and appropriate action can be taken when either conclusion is reached.

“ForLater” - Calculate \({\beta}\)

When the true population mean \({\mu}\) is close to the null hypothesis value of \({\mu} = 120\), the probability is high that we will make a Type II error. However, when the true population mean \({\mu}\) is far below the null hypothesis value of \({\mu} = 120\), the probability is low that we will make a Type II error.

Definition 24.28 The probability of correctly rejecting \({H_0}\) when it is false is called the power of the test. For any particular value of \({\mu}\), the power is \(1 − \beta\).
Definition 24.29 Power Curve is a graph of the probability of rejecting \({H_0}\) for all possible values of the population parameter \({\mu}\) not satisfying the null hypothesis. It provides the probability of correctly rejecting the null hypothesis.

Note that the power curve extends over the values of \({\mu}\) for which the null hypothesis is false. The height of the power curve at any value of \({\mu}\) indicates the probability of correctly rejecting \({H_0}\) when \({H_0}\) is false.

24.12 Summary

We can make 3 observations about the relationship among \({\alpha}, \beta, n (\text{sample size})\).

  1. Once two of the three values are known, the other can be computed.
  2. For a given level of significance \({\alpha}\), increasing the sample size will reduce \({\beta}\).
  3. For a given sample size, decreasing \({\alpha}\) will increase \({\beta}\), whereas increasing \({\alpha}\) will decrease \({\beta}\).

Validation


25 Two Populations

25.1 Overview

25.2 Introduction

How interval estimates and hypothesis tests can be developed for situations involving two populations when the difference between the two population means or the two population proportions is of prime importance.

Example

  • To develop an interval estimate of the difference between the mean starting salary for a population of men and the mean starting salary for a population of women.
  • To conduct a hypothesis test to determine whether any difference is present between the proportion of defective parts in a population of parts produced by supplier A and the proportion of defective parts in a population of parts produced by supplier B.

25.3 Known SD: Two Population Means

Inferences About the Difference Between Two Population Means

Definition 25.1 Let \({\mathcal{N}}_{({\mu}_1, \, {\sigma}_1)}\) and \({\mathcal{N}}_{({\mu}_2, \, {\sigma}_2)}\) be the two populations. To make an inference about the difference between the means \(({\mu}_1 - {\mu}_2)\), we select a simple random sample of \({n}_1\) units from population 1 and a second simple random sample of \({n}_2\) units from population 2. The two samples, taken separately and independently, are referred to as independent simple random samples.

25.3.1 Interval Estimation

Interval Estimation of \(({\mu}_1 - {\mu}_2)\)

The point estimator of the difference between the two population means \(({\mu}_1 - {\mu}_2)\) is the difference between the two sample means \(({\overline{x}}_1 - {\overline{x}}_2)\). Thus, \(E_{( {\overline{x}}_1 - {\overline{x}}_2 )}\) represents the difference of population means. It is given by equation (25.1)

Point Estimate :

\[E_{( {\overline{x}}_1 - {\overline{x}}_2 )} = {\overline{x}}_1 - {\overline{x}}_2 \tag{25.1}\]

As with other point estimators, the point estimator \(E_{( {\overline{x}}_1 - {\overline{x}}_2 )}\) has a standard error \({\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)}\), that describes the variation in the sampling distribution of the estimator. It is the standard deviation of the sampling distribution of \(({\overline{x}}_1 - {\overline{x}}_2)\). Refer equation (25.2)

Standard Error of \(({\overline{x}}_1 - {\overline{x}}_2)\) :

\[{\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = \sqrt{\frac{{\sigma}_1^2}{{n}_1} + \frac{{\sigma}_2^2}{{n}_2}} \tag{25.2}\]

23.1 Because a point estimator cannot be expected to provide the exact value of the population parameter, an interval estimate is often computed by adding and subtracting a value, called the margin of error (MOE), to the point estimate. \(\text{Interval Estimate} = \text{Point Estimate} \pm \text{MOE}_{\gamma}\)

In the case of estimation of the difference between two population means, an interval estimate will take the following form: \(E_{( {\overline{x}}_1 - {\overline{x}}_2 )} \, \pm \text{MOE}_{{\gamma}}\). Refer equation (25.3) and (25.4)

Margin of Error (\(\text{MOE}_{{\gamma}}\)) :

\[\text{MOE}_{{\gamma}} = {z}_{\frac{{\alpha}}{2}}{\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = {z}_{\frac{{\alpha}}{2}}\sqrt{\frac{{\sigma}_1^2}{{n}_1} + \frac{{\sigma}_2^2}{{n}_2}} \tag{25.3}\]

\(\text{Interval Estimate}_{\gamma}\) :

\[\text{Interval Estimate}_{\gamma} = ({\overline{x}}_1 - {\overline{x}}_2) \pm {z}_{\frac{{\alpha}}{2}} \sqrt{\frac{{\sigma}_1^2}{{n}_1} + \frac{{\sigma}_2^2}{{n}_2}} \tag{25.4}\]

Example: Greystone: Difference between the mean

25.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)

  • (1: Inner ) \({n}_1 = 36, {\overline{x}}_1 = 40, {\sigma}_1 = 9\)
  • (2: Suburb) \({n}_2 = 49, {\overline{x}}_2 = 35, {\sigma}_2 = 10\)
  • For \({\gamma = 0.95} \iff{\alpha} = 0.05 \to {z_{{\alpha}/2}} = {z_{0.025}} = 1.96\)
    • qnorm(p = 0.025, lower.tail = FALSE) \(\#\mathcal{R}\)
  • (25.2) \({\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = \sqrt{\frac{{\sigma}_1^2}{{n}_1} + \frac{{\sigma}_2^2}{{n}_2}} = \sqrt{\frac{{9}^2}{36} + \frac{{10}^2}{49}} = 2.0714\)
    • sqrt(9^2/36 + 10^2/49) \(\#\mathcal{R}\)
  • (25.3) \(\text{MOE}_{\gamma} = {z}_{\frac{{\alpha}}{2}} {\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = 1.96 * 2.071 = 4.06\)
  • (25.4) \(\text{Interval Estimate}_{\gamma} = ({\overline{x}}_1 - {\overline{x}}_2) \pm \text{MOE}_{\gamma} = (40 - 35) \pm 4.06 = 5 \pm 4.06\)

25.3.2 Hypothesis Tests

Using \({D_0}\) to denote the hypothesized difference between \({\mu}_1\) and \({\mu}_2\), the three forms for a hypothesis test are as follows:

Definition 25.2 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu}_1 - {\mu}_2 \geq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 < {D_0}\)
Definition 25.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)
Definition 25.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)

The test statistic for the difference between two population means when \({\sigma}_1\) and \({\sigma}_2\) are known is given in equation (25.5)

Test Statistic for Hypothesis Tests :

\[z = \frac{({\overline{x}}_1 - {\overline{x}}_2) - {D}_0}{\sqrt{\frac{{\sigma}_1^2}{{n}_1} + \frac{{\sigma}_2^2}{{n}_2}}} \tag{25.5}\]

Example

Example: Evaluate differences in education quality between two training centers

25.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)

  • (1: A) \({n}_1 = 30, {\overline{x}}_1 = 82, {\sigma}_1 = 10\)
  • (2: B) \({n}_2 = 40, {\overline{x}}_2 = 78, {\sigma}_2 = 10\)
  • (25.2) \({\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = \sqrt{\frac{{10}^2}{30} + \frac{{10}^2}{40}} = 2.4152\)
    • sqrt(10^2/30 + 10^2/40) \(\#\mathcal{R}\)
  • (25.5) \(z = \frac{({\overline{x}}_1 - {\overline{x}}_2) - {D}_0}{{\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)}} = \frac{(82 - 78) - 0}{2.415} = 1.66\)
  • Calculate \({}^2\!P_{(z)}\)
    • Because z is in upper tail, we get upper tail area and because it is Two-Tail Test, we double it
    • \({}^2\!P_{(z = 1.66)} = 2 * {}^U\!P_{(z = 1.66)} = 2 * 0.0485 = 0.0970\)
      • 2 * pnorm(q = 1.66, lower.tail = FALSE) \(\#\mathcal{R}\)
  • Compare with \({\alpha} = 0.05\)
    • \({}^2\!P_{(z)} > {\alpha} \to {H_0}\) cannot be rejected
    • The sample results do not provide sufficient evidence to conclude the training centers differ in quality.

Comparison

# #Get P(z) for z = 1.66 (Two-Tail)
#
# #Get the default (lower), subtract from 1, Double if Two-Tail
ii <- 2 * {1 - pnorm(q = 1.66)}
jj <- 2 * {1 - pnorm(q = 1.66, lower.tail = TRUE)}
#
# #Use the symmetry i.e. 'minus z' value, Double if Two-Tail
kk <- 2 * pnorm(q = -1.66)
#
# #Use the actual Upper Tail Option, Double if Two-Tail
ll <- 2 * pnorm(q = 1.66, lower.tail = FALSE)
#
stopifnot(all(identical(round(ii, 7), round(jj, 7)), identical(round(ii, 7), round(kk, 7)),
              identical(round(ii, 7), round(ll, 7))))
ll
## [1] 0.09691445

Shapiro-Wilk test

Definition 25.5 The Shapiro-Wilk test is a test of normality. It tests the null hypothesis that a sample came from a normally distributed population. \(P_{\text{shapiro}} > ({\alpha} = 0.05) \to \text{Data is Normal}\). Avoid using sample with more than 5000 observations.

The null-hypothesis of this test is that the population is normally distributed. Thus, if the p value is less than the chosen alpha level, then the null hypothesis is rejected and there is evidence that the data being tested is not distributed normally.

On the other hand, if the p value is greater than the chosen alpha level, then the null hypothesis (that the data came from a normally distributed population) cannot be rejected (e.g., for an alpha level of .05, a data set with a p value of less than .05 rejects the null hypothesis that the data are from a normally distributed population).

Shapiro–Wilk test is known not to work well in samples with many identical values.

ERROR 25.1 Error in shapiro.test(...) : sample size must be between 3 and 5000
  • Ideally we should not test for normality of a sample with more than 5000 observations.
    • However, we can randomly select 5000 observations and test them by Shapiro.
    • We can use Anderson–Darling test or Kolmogorov-Smirnov test (Nonparametric) (KS is weaker than AD)
    • Anderson–Darling test is not quite as good as the Shapiro–Wilk test, but is better than other tests.
  • (SO) Normality Testing
    • Normality Tests like Shapiro or Anderson Darling are NULL Hypothesis Tests against the assumption of normality.
      • When the sample size is small, even big departures from normality are not detected
      • When your sample size is large, even the smallest deviation from normality will lead to a rejected null.
        • In Shapiro-Wilk test, with more data, the chances of the null hypothesis being rejected becomes larger even though for practical purposes the data is more than normal enough.
    • Do not worry much about normality. The CLT takes over quickly and if you have all but the smallest sample sizes and an even remotely reasonable looking histogram you are fine.
    • Worry about unequal variances (heteroskedasticity). Heteroscedasticity-consistent standard errors are checked by HCCM tests. A scale location plot will give some idea of whether this is broken, but not always.
    • Also, there is no a priori reason to assume equal variances in most cases.
    • Further Outliers: A cooks distance of > 1 is reasonable cause for concern.

25.4 Exercises

01

25.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)

  • (1:) \({n}_1 = 50, {\overline{x}}_1 = 13.6, {\sigma}_1 = 2.2\)
  • (2:) \({n}_2 = 35, {\overline{x}}_2 = 11.6, {\sigma}_2 = 3.0\)
  • What is the point estimate of the difference between the two population means
    • (25.1) \(E_{( {\overline{x}}_1 - {\overline{x}}_2 )} = 13.6 - 11.6 = 2\)
  • (25.2) \({\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = \sqrt{\frac{{2.2}^2}{50} + \frac{{3}^2}{35}} = 0.5949\)
    • sqrt(2.2^2/50 + 3^2/35) \(\#\mathcal{R}\)
  • Provide a 90% confidence interval for the difference between the two population means
    • \({\gamma = 0.90} \iff{\alpha} = 0.10 \to {z_{{\alpha}/2}} = {z_{0.05}} = 1.6448\)
      • qnorm(p = 0.05, lower.tail = FALSE) \(\#\mathcal{R}\)
    • (25.3) \(\text{MOE}_{\gamma =0.90} = 1.6448 * 0.5949 = 0.9785\)
    • (25.4) \(\text{Interval Estimate}_{\gamma} = 2 \pm 0.9785\)
  • Provide a 95% confidence interval for the difference between the two population means
    • \({\gamma = 0.95} \iff{\alpha} = 0.05 \to {z_{{\alpha}/2}} = {z_{0.025}} = 1.96\)
    • \(\text{MOE}_{\gamma =0.95} = 1.96 * 0.5949 = 1.166\)
    • \(\text{Interval Estimate}_{\gamma} = 2 \pm 1.166\)

02

25.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)

  • (1:) \({n}_1 = 40, {\overline{x}}_1 = 25.2, {\sigma}_1 = 5.2\)
  • (2:) \({n}_2 = 50, {\overline{x}}_2 = 22.8, {\sigma}_2 = 6.0\)
  • (25.2) \({\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = \sqrt{\frac{{5.2}^2}{40} + \frac{{6}^2}{50}} = 1.1815\)
    • sqrt(5.2^2/40 + 6^2/50) \(\#\mathcal{R}\)
  • What is the value of the test statistic
    • (25.5) \(z = \frac{(25.2 - 22.8) - 0}{1.18} = 2.03\)
  • What is the p-value
    • \({}^U\!P_{z = 2.03} = 0.0212\)
      • pnorm(q = 2.03, lower.tail = FALSE) \(\#\mathcal{R}\)
  • With \({\alpha} = 0.05\), what is your hypothesis testing conclusion
    • \({}^U\!P_{z} < {\alpha} \to {H_0}\) Rejected

04 Conde

25.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)

  • (1: small) \({n}_1 = 37, {\overline{x}}_1 = 85.36, {\sigma}_1 = 4.55\)
  • (2: large) \({n}_2 = 44, {\overline{x}}_2 = 81.40, {\sigma}_2 = 3.97\)
  • (25.1) \(E_{( {\overline{x}}_1 - {\overline{x}}_2 )} = 85.36 - 81.40 = 3.96\)
  • (25.2) \({\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = \sqrt{\frac{{4.55}^2}{37} + \frac{{3.97}^2}{44}} = 0.958\)
    • sqrt(4.55^2/37 + 3.97^2/44) \(\#\mathcal{R}\)
  • \({\gamma = 0.95} \iff{\alpha} = 0.05 \to {z_{{\alpha}/2}} = {z_{0.025}} = 1.96\)
    • qnorm(p = 0.025, lower.tail = FALSE) \(\#\mathcal{R}\)
  • (25.3), \(\text{MOE}_{\gamma =0.95} = 1.96 * 0.958 = 1.87768\)
  • (25.4), \(\text{Interval Estimate}_{\gamma} = 3.96 \pm 1.88\)
  • Test Statistic z
    • (25.5) \(z = \frac{(85.36 - 81.40 ) - 0}{0.958} = 4.13369\)
  • p-value
    • \({}^2\!P_{(z = 4.13369)} = 2 * {}^U\!P_{(z = 4.13369)} = 2 * 0.0000178 \approx 0\)
      • pnorm(q = 4.13369, lower.tail = FALSE) \(\#\mathcal{R}\)
    • \({}^2\!P_{z} < {\alpha} \to {H_0}\) Rejected

08 Rite Aid

25.2 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu}_1 - {\mu}_2 \geq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 < {D_0}\)

  • Will improving customer service result in higher stock prices
  • For each case: \({n} = 60, {\sigma} = 6, {\alpha} = 0.05\)
    • (25.2) \({\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = \sqrt{\frac{{6}^2}{60} + \frac{{6}^2}{60}} = 1.0954\)
      • sqrt(6^2/60 + 6^2/60) \(\#\mathcal{R}\)
  • Rite \(\{{\overline{x}}_1 = 73, {\overline{x}}_2 = 76\}\), Expedia \(\{{\overline{x}}_1 = 75, {\overline{x}}_2 = 77\}\), JC \(\{{\overline{x}}_1 = 77, {\overline{x}}_2 = 78\}\)
  • For Rite Aid, is the increase in the satisfaction score from year 1 to year 2 statistically significant
    • (25.5) \(z = \frac{(73 - 76) - 0}{1.0954} = -2.738613\)
    • \({}^L\!P_{(z = -2.738613)} = 0.003\)
      • pnorm(q = -2.738613, lower.tail = TRUE) \(\#\mathcal{R}\)
      • \({}^L\!P_{z} < {\alpha} \to {H_0}\) is rejected i.e. the increase is significant
      • Caution: Using Two-Tail Test will result in different result. The type of test should be considered carefully.

25.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)

  • Can you conclude that the year 2 score for Rite Aid is above the national average of 75.7
    • 1 is 76, 2 is national 75.7
    • (25.5) \(z = \frac{(76 - 75.7) - 0}{1.0954} = 0.2739\)
    • \({}^U\!P_{(z = 0.2739)} = 0.392\)
      • pnorm(q = 0.2739, lower.tail = FALSE) \(\#\mathcal{R}\)
      • \({}^U\!P_{z} > {\alpha} \to {H_0}\) cannot be rejected. Difference is NOT significant.
      • Cannot conclude that the increase is above the national average

25.2 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu}_1 - {\mu}_2 \geq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 < {D_0}\)

  • For Expedia, is the increase from year 1 to year 2 statistically significant
    • (25.5) \(z = \frac{(75 - 77) - 0}{1.0954} = -1.826\)
    • \({}^L\!P_{(z = -1.826)} = 0.0339\)
      • pnorm(q = -1.826, lower.tail = TRUE) \(\#\mathcal{R}\)
      • \({}^L\!P_{z} < {\alpha} \to {H_0}\) is rejected i.e. the increase is significant
  • When conducting a hypothesis test with the values given for the standard deviation, sample size, and alpha, how large must the increase from year 1 to year 2 be for it to be statistically significant
    • \({\alpha} = 0.05 \iff {}^L\!P_{(z)} = 0.05 \to {z_{0.05}} = -1.6448\)
      • qnorm(p = 0.05, lower.tail = TRUE) \(\#\mathcal{R}\)
    • (25.3), \(\text{MOE}_{\gamma =0.95} = -1.6448 * 1.0954 = -1.8\)
    • At least 1.8 increase should be there for result to be significant
    • For JC, because the increase is only 1 i.e. less than 1.8, it will not be significant

25.5 Unknown SD: Two Population Means

Use the sample standard deviations, \({s}_1\) and \({s}_2\), to estimate the unknown population standard deviations \(({\sigma}_1, {\sigma}_2)\).

When \(({\sigma}_1, {\sigma}_2)\) are estimated by \(({s}_1, {s}_2)\), the t distribution is used to make inferences about the difference between two population means.

Interval Estimation of \(({\mu}_1 - {\mu}_2)\)

Margin of Error (\(\text{MOE}_{{\gamma}}\)) : Refer (25.6) like (25.3)

\[\text{MOE}_{{\gamma}} = {t}_{\frac{{\alpha}}{2}}{\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = {t}_{\frac{{\alpha}}{2}}\sqrt{\frac{{\sigma}_1^2}{{n}_1} + \frac{{\sigma}_2^2}{{n}_2}} \tag{25.6}\]

\(\text{Interval Estimate}_{\gamma}\) : Refer (25.7) like (25.4) using (25.2)

\[\text{Interval Estimate}_{\gamma} = ({\overline{x}}_1 - {\overline{x}}_2) \pm {t}_{\frac{{\alpha}}{2}}{\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = ({\overline{x}}_1 - {\overline{x}}_2) \pm {t}_{\frac{{\alpha}}{2}}\sqrt{\frac{{s}_1^2}{{n}_1} + \frac{{s}_2^2}{{n}_2}} \tag{25.7}\]

Degrees of Freedom (DOF) : Refer (25.8)

\[\text{DOF} = \frac{ { \left( \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} \right) }^2} {\frac{1}{n_1 - 1}{ \left( \frac{s_1^2}{n_1} \right) }^2 + \frac{1}{n_2 - 1}{ \left( \frac{s_2^2}{n_2} \right) }^2} \tag{25.8}\]

Example: Clearwater - To estimate the difference between the mean

25.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)

  • (1: Cherry) \({n}_1 = 28, {\overline{x}}_1 = 1025, {s}_1 = 150\)
  • (2: Beech) \({n}_2 = 22, {\overline{x}}_2 = 910, {s}_2 = 125\)
  • (25.2) \({\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = \sqrt{\frac{{150}^2}{28} + \frac{{125}^2}{22}} = 38.9076\)
    • sqrt(150^2/28 + 125^2/22) \(\#\mathcal{R}\)
  • (25.8) \(\text{DOF} = 47\)
    • floor({150^2 / 28 + 125^2 / 22 }^2 / {{150^2 / 28}^2/{28-1} + {125^2 / 22}^2/{22-1}}) \(\#\mathcal{R}\)
  • \({\gamma = 0.95} \iff{\alpha} = 0.05 \to {{}^2\!t_{{\alpha}/2}} = {{}^2\!t_{0.025}} = 2.012\)
    • qt(p = 0.025, df = 47, lower.tail = FALSE) \(\#\mathcal{R}\)
  • (25.6) \(\text{MOE}_{\gamma =0.95} = 2.012 * 38.9076 = 78.3\)
  • (25.7) \(\text{Interval Estimate}_{\gamma} = (1025-910) \pm 78\)

Hypothesis Tests

Test Statistic for Hypothesis Tests : Refer (25.9) like (25.5) using (25.2)

\[t = \frac{({\overline{x}}_1 - {\overline{x}}_2) - {D}_0}{{\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)}} = \frac{({\overline{x}}_1 - {\overline{x}}_2) - {D}_0}{\sqrt{\frac{{s}_1^2}{{n}_1} + \frac{{s}_2^2}{{n}_2}}} \tag{25.9}\]

Caution: Use of \({\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)}\) symbol is probably wrong here because it should not represent formula containing \({s}_1, {s}_2\). “ForLater”

ERROR 25.2 Error in t.test.formula() : grouping factor must have exactly 2 levels
  • For t.test(): Formula needs to be ‘value ~ key,’ not ‘key ~ value’

Example

Software: To show that the new software will provide a shorter mean time

25.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)

  • (1: Old) \({n}_1 = 12, {\overline{x}}_1 = 325, {s}_1 = 40\)
  • (2: New) \({n}_2 = 12, {\overline{x}}_2 = 286, {s}_2 = 44\)
  • (25.2) \({\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = \sqrt{\frac{{40}^2}{12} + \frac{{44}^2}{12}} = 17.1659\)
    • sqrt(40^2/12 + 44^2/12) \(\#\mathcal{R}\)
  • (25.8) \(\text{DOF} = 21\)
    • floor({40^2 / 12 + 44^2 / 12 }^2 / {{40^2 / 12}^2/{12-1} + {44^2 / 12}^2/{12-1}}) \(\#\mathcal{R}\)
  • (25.9) \(t = \frac{(325 - 286) - 0}{17.1659} = 2.272\)
  • Calculate \({}^U\!P_{(t)}\)
    • \({}^U\!P_{(t = 2.272)} = 0.0168\)
      • pt(q = 2.272, df = 21, lower.tail = FALSE) \(\#\mathcal{R}\)
  • Compare with \({\alpha} = 0.05\)
    • \({}^U\!P_{(t)} < {\alpha} \to {H_0}\) is rejected i.e. the decrease is significant
    • It supports the conclusion that the new software provides a smaller population mean.

Code

CONVERT HERE to Tilde Based Option, to take advantage of Column Headers

# #Software 
xxSoftware <- tibble(Old = c(300, 280, 344, 385, 372, 360, 288, 321, 376, 290, 301, 283), 
                     New = c(274, 220, 308, 336, 198, 300, 315, 258, 318, 310, 332, 263))
aa <- xxSoftware
# Summary
aa %>% 
  pivot_longer(everything(), names_to = "key", values_to = "value") %>% 
  group_by(key) %>% 
  summarise(across(value, list(Count = length, Mean = mean, SD = sd), .names = "{.fn}"))
## # A tibble: 2 x 4
##   key   Count  Mean    SD
##   <chr> <int> <dbl> <dbl>
## 1 New      12   286  44.0
## 2 Old      12   325  40.0
bb <- aa %>% pivot_longer(everything(), names_to = "key", values_to = "value")
#
# #Welch Two Sample t-test
# #Alternative must be: "two.sided" (Default), "less", "greater"
bb_ha <- "greater"
#bb_testT <- t.test(x = bb$Old, y = bb$New, alternative = bb_ha)
bb_testT <- t.test(formula = value ~ key, data = bb, alternative = bb_ha)
bb_testT
## 
##  Welch Two Sample t-test
## 
## data:  value by key
## t = -2.2721, df = 21.803, p-value = 0.9833
## alternative hypothesis: true difference in means between group New and group Old is greater than 0
## 95 percent confidence interval:
##  -68.48569       Inf
## sample estimates:
## mean in group New mean in group Old 
##               286               325
#
cat(paste0("t is the t-test statistic value (t = ", round(bb_testT$statistic, 6), ")\n"))
## t is the t-test statistic value (t = -2.272127)
cat(paste0("df is the degrees of freedom (df = ", round(bb_testT$parameter, 1), ")\n"))
## df is the degrees of freedom (df = 21.8)
cat(paste0("p-value is the significance level of the t-test (p-value = ", 
           round(bb_testT$p.value, 6), ")\n"))
## p-value is the significance level of the t-test (p-value = 0.98335)
cat(paste0("conf.int is the confidence interval of the mean at 95% (conf.int = [",
           paste0(round(bb_testT$conf.int, 3), collapse = ", "), "])\n"))
## conf.int is the confidence interval of the mean at 95% (conf.int = [-68.486, Inf])
cat(paste0("sample estimates is the mean value of the samples. Mean: ", 
           paste0(round(bb_testT$estimate, 2), collapse = ", "), "\n"))
## sample estimates is the mean value of the samples. Mean: 286, 325
#
# #Compare p-value with alpha = 0.05
alpha <- 0.05
if(any(all(bb_ha == "two.sided", bb_testT$p.value >= alpha / 2), 
       all(bb_ha != "two.sided", bb_testT$p.value >= alpha))) {
  cat(paste0("p-value (", round(bb_testT$p.value, 6), ") is greater than alpha (", alpha, 
      "). We failed to reject H0. We cannot conclude that the populations are different.\n")) 
} else {
    cat(paste0("p-value (", round(bb_testT$p.value, 6), ") is less than alpha (", alpha, 
      ").\nWe can reject the H0 with 95% confidence. The populations are different.\n"))
}
## p-value (0.98335) is greater than alpha (0.05). We failed to reject H0. We cannot conclude that the populations are different.

Pooled

Another approach used to make inferences about the difference between two population means when \({\sigma}_1\) and \({\sigma}_2\) are unknown is based on the assumption that the two population standard deviations are equal \(({\sigma}_1 = {\sigma}_2 = {\sigma})\). Under this assumption, the two sample standard deviations are combined to provide the pooled sample variance as given in equation (25.10).

\[{s}_p^2 = \frac{({n}_1 - 1){s}_1^2 + ({n}_2 - 1){s}_2^2}{{n}_1 + {n}_2 - 2} \tag{25.10}\]

The t test statistic becomes (25.11) with \(({n}_1 + {n}_2 - 2)\) degrees of freedom.

\[t = \frac{({\overline{x}}_1 - {\overline{x}}_2) - {D}_0}{{s}_p\sqrt{\frac{1}{{n}_1} + \frac{1}{{n}_2}}} \tag{25.11}\]

Then the computation of the p-value and the interpretation of the sample results are same as earlier.

  • Caution: A difficulty with this procedure is that the assumption that the two population standard deviations are equal is usually difficult to verify.
    • Unequal population standard deviations are frequently encountered.
    • Using the pooled procedure may not provide satisfactory results, especially if the sample sizes \({n}_1\) and \({n}_2\) are quite different.
    • The original t procedure does not require this assumption. It is a more general procedure and is recommended for most applications.

Pooled Code

(External) Unpaired Two-Samples T-test

# #Pooled Sample Variance
# #Evaluate differences in education quality between two training centers
# #Generate Data for Two Centers
set.seed(3)
setA <- rnorm(n = 30, mean = 82, sd = 10)
setB <- rnorm(n = 40, mean = 78, sd = 10)
bb <- tibble(sets = c(rep("setA", length(setA)), rep("setB", length(setB))), values = c(setA, setB))
# Summary
bb %>% group_by(sets) %>% summarise(Count = n(), Mean = mean(values), SD = sd(values))
## # A tibble: 2 x 4
##   sets  Count  Mean    SD
##   <chr> <int> <dbl> <dbl>
## 1 setA     30  79.7  8.10
## 2 setB     40  78.8  9.43
#
# #Assumption 1: Are the two samples independents
# #YES
#
# #Assumtion 2: Are the data from each of the 2 groups follow a normal distribution
# #Shapiro-Wilk normality test
isNormal_A <- with(bb, shapiro.test(values[sets == "setA"]))
isNormal_B <- with(bb, shapiro.test(values[sets == "setB"]))
#isNormal_A
#isNormal_A$statistic
# #p-value > 0.05 is needed for Normality
isNormal_A$p.value
## [1] 0.1032747
isNormal_B$p.value
## [1] 0.15595
#
# #Both have p-values greater than the significance level alpha = 0.05.
# #implying that the distribution of the data are not significantly different from the normal
# #In other words, we can assume the normality.
#
bb %>% 
  group_by(sets) %>% 
  summarise(p = shapiro.test(values)$p.value)
## # A tibble: 2 x 2
##   sets      p
##   <chr> <dbl>
## 1 setA  0.103
## 2 setB  0.156
#
# #Assumption 3. Do the two populations have the same variances
# #We will use F-test to test for homogeneity in variances. (p-value > 0.05 is needed)
#
bb_testF <- var.test(values ~ sets, data = bb)
# bb_testF
bb_testF$p.value
## [1] 0.3990791
#
# #The p-value of F-test is greater than the significance level alpha = 0.05. 
# #In conclusion, there is no significant difference between the variances of the two sets of data.
# #Therefore, we can use the classic t-test witch assume equality of the two variances.
#
# #Compute unpaired two-samples t-test
#
# #Question : Is there any significant difference between the mean of two populations
#
bb_testT <- t.test(formula = values ~ sets, data = bb, var.equal = TRUE)
bb_testT
## 
##  Two Sample t-test
## 
## data:  values by sets
## t = 0.41947, df = 68, p-value = 0.6762
## alternative hypothesis: true difference in means between group setA and group setB is not equal to 0
## 95 percent confidence interval:
##  -3.383982  5.185373
## sample estimates:
## mean in group setA mean in group setB 
##           79.66783           78.76714
#
cat(paste0("t is the t-test statistic value (t = ", round(bb_testT$statistic, 6), ")\n"))
## t is the t-test statistic value (t = 0.419474)
cat(paste0("df (n1 + n2 - 2) is the degrees of freedom (df = ", bb_testT$parameter, ")\n"))
## df (n1 + n2 - 2) is the degrees of freedom (df = 68)
cat(paste0("p-value is the significance level of the t-test (p-value = ", 
           round(bb_testT$p.value, 6), ")\n"))
## p-value is the significance level of the t-test (p-value = 0.676192)
cat(paste0("conf.int is the confidence interval of the mean at 95% (conf.int = [",
           paste0(round(bb_testT$conf.int, 3), collapse = ", "), "])\n"))
## conf.int is the confidence interval of the mean at 95% (conf.int = [-3.384, 5.185])
cat(paste0("sample estimates is the mean value of the samples. Mean: ", 
           paste0(round(bb_testT$estimate, 2), collapse = ", "), "\n"))
## sample estimates is the mean value of the samples. Mean: 79.67, 78.77
#
# #Compare p-value with alpha = 0.05
alpha <- 0.05
if(bb_testT$p.value >= alpha) {
  cat(paste0("p-value (", round(bb_testT$p.value, 6), ") is greater than alpha (", alpha, 
      "). We failed to reject H0. We cannot conclude that the populations are different.\n")) 
} else {
    cat(paste0("p-value (", round(bb_testT$p.value, 6), ") is less than alpha (", alpha, 
      ")\n. We can reject the H0 with 95% confidence. The populations are different.\n"))
}
## p-value (0.676192) is greater than alpha (0.05). We failed to reject H0. We cannot conclude that the populations are different.

Columns Summary

bb <- xxSoftware
str(bb)
## tibble [12 x 2] (S3: tbl_df/tbl/data.frame)
##  $ Old: num [1:12] 300 280 344 385 372 360 288 321 376 290 ...
##  $ New: num [1:12] 274 220 308 336 198 300 315 258 318 310 ...
#
# #Applying Multiple Functions with Summarise but Output as Cross-Table
# #(Original) Columns as Rows, Functions as Columns
# #NOTE: n() can be applied as lambda function 
bb %>% 
  pivot_longer(everything(), names_to = "key", values_to = "value") %>% 
  group_by(key) %>% 
  summarise(across(value, list(N = ~n(), Count = length, Mean = mean, SD = sd), .names = "{.fn}"))
## # A tibble: 2 x 5
##   key       N Count  Mean    SD
##   <chr> <int> <int> <dbl> <dbl>
## 1 New      12    12   286  44.0
## 2 Old      12    12   325  40.0

pivot_longer()

str(bb)
## tibble [12 x 2] (S3: tbl_df/tbl/data.frame)
##  $ Old: num [1:12] 300 280 344 385 372 360 288 321 376 290 ...
##  $ New: num [1:12] 274 220 308 336 198 300 315 258 318 310 ...
#
# #gather() is deprecated. Here is for reference.
# #Longer Tibble is filled with All Values of Col A, then All Values fo Col B and so on
ii <- gather(bb)
jj <- bb %>% gather("key", "value")
kk <- bb %>% gather("key", "value", everything()) 
#
# #pivot_longer()
# #Longer Tibble is filled with First Row of All Columns, then 2nd Row of All Columns and so on
ll <- bb %>% pivot_longer(everything(), names_to = "key", values_to = "value") %>% arrange(desc(key))
stopifnot(all(identical(ii, jj), identical(ii, kk), identical(ii, ll)))

Multiple Functions

str(bb)
## tibble [12 x 2] (S3: tbl_df/tbl/data.frame)
##  $ Old: num [1:12] 300 280 344 385 372 360 288 321 376 290 ...
##  $ New: num [1:12] 274 220 308 336 198 300 315 258 318 310 ...
# #Store a Grouped Tibble
ii <- bb %>% 
  pivot_longer(everything(), names_to = "key", values_to = "value") %>% 
  group_by(key) 
str(ii)
## grouped_df [24 x 2] (S3: grouped_df/tbl_df/tbl/data.frame)
##  $ key  : chr [1:24] "Old" "New" "Old" "New" ...
##  $ value: num [1:24] 300 274 280 220 344 308 385 336 372 198 ...
##  - attr(*, "groups")= tibble [2 x 2] (S3: tbl_df/tbl/data.frame)
##   ..$ key  : chr [1:2] "New" "Old"
##   ..$ .rows: list<int> [1:2] 
##   .. ..$ : int [1:12] 2 4 6 8 10 12 14 16 18 20 ...
##   .. ..$ : int [1:12] 1 3 5 7 9 11 13 15 17 19 ...
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE
ii %>% summarise(across(value, list(N = ~n(), Count = length, Mean = mean, SD = sd), 
                        .names = "{.fn}"))
## # A tibble: 2 x 5
##   key       N Count  Mean    SD
##   <chr> <int> <int> <dbl> <dbl>
## 1 New      12    12   286  44.0
## 2 Old      12    12   325  40.0
# 
# #Equivalent (except Column Headers)
ii %>% summarise(N = n(), Count = across(value, length), 
                 Mean = across(value, mean), SD = across(value, sd))
## # A tibble: 2 x 5
##   key       N Count$value Mean$value SD$value
##   <chr> <int>       <int>      <dbl>    <dbl>
## 1 New      12          12        286     44.0
## 2 Old      12          12        325     40.0

25.6 Exercises

“ForLater”

09

  • (1:) \({n}_1 = 20, {\overline{x}}_1 = 22.5, {\sigma}_1 = 2.5\)
  • (2:) \({n}_2 = 30, {\overline{x}}_2 = 20.1, {\sigma}_2 = 4.8\)
  • (25.2) \({\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = \sqrt{\frac{{2.5}^2}{20} + \frac{{4.8}^2}{30}} = 1.0395\)
    • sqrt(2.5^2/20 + 4.8^2/30) \(\#\mathcal{R}\)
  • point estimate of the difference between the two population means
    • (25.1) \(E_{( {\overline{x}}_1 - {\overline{x}}_2 )} = 22.5 - 20.1 = 2.4\)
  • (25.8) \(\text{DOF} = 45\)
    • floor({2.5^2 / 20 + 4.8^2 / 30 }^2 / {{2.5^2 / 20}^2/{20-1} + {4.8^2 / 30}^2/{30-1}}) \(\#\mathcal{R}\)
  • \({\gamma = 0.95} \iff{\alpha} = 0.05 \to {{}^2\!t_{{\alpha}/2}} = {{}^2\!t_{0.025}} = 2.014\)
    • qt(p = 0.025, df = 45, lower.tail = FALSE) \(\#\mathcal{R}\)
  • (25.6) \(\text{MOE}_{\gamma =0.95} = 2.014 * 1.0395 = 2.094 \approx 2.1\)
  • (25.7) \(\text{Interval Estimate}_{\gamma} = 2.4 \pm 2.1\)

10

25.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)

  • (1:) \({n}_1 = 35, {\overline{x}}_1 = 13.6, {\sigma}_1 = 5.2\)
  • (2:) \({n}_2 = 40, {\overline{x}}_2 = 10.1, {\sigma}_2 = 8.5\)
  • (25.1) \(E_{( {\overline{x}}_1 - {\overline{x}}_2 )} = 13.6 - 10.1 = 3.5\)
  • (25.2) \({\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = \sqrt{\frac{{5.2}^2}{35} + \frac{{8.5}^2}{40}} = 1.6059\)
    • sqrt(5.2^2/35 + 8.5^2/40) \(\#\mathcal{R}\)
  • (25.8) \(\text{DOF} = 65\)
    • floor({5.2^2 / 35 + 8.5^2 / 40 }^2 / {{5.2^2 / 35}^2/{35-1} + {8.5^2 / 40}^2/{40-1}}) \(\#\mathcal{R}\)
  • (25.9) \(t = \frac{(13.6 - 10.1) - 0}{1.6059} = 2.179\)
  • \({}^2\!P_{(t = 2.179)} = 2 * {}^U\!P_{(t = 2.179)} = 2 * 0.0165 = 0.033\)
    • pt(q = 2.179, df = 65, lower.tail = FALSE) \(\#\mathcal{R}\)
  • Compare with \({\alpha} = 0.05\)
    • \({}^2\!P_{(t)} < {\alpha} \to {H_0}\) is rejected

25.7 Matched Samples (Paired)

Suppose employees at a manufacturing company can use two different methods to perform a production task. To maximize production output, the company wants to identify the method with the smaller population mean completion time. We can use two alternative designs for the sampling procedure.

Definition 25.6 Independent sample design: A simple random sample of workers is selected and each worker in the sample uses method 1. A second independent simple random sample of workers is selected and each worker in this sample uses method 2.
Definition 25.7 Matched sample design: One simple random sample of workers is selected. Each worker first uses one method and then uses the other method. The order of the two methods is assigned randomly to the workers, with some workers performing method 1 first and others performing method 2 first. Each worker provides a pair of data values, one value for method 1 and another value for method 2.

In the matched sample design the two production methods are tested under similar conditions (i.e., with the same workers); hence this design often leads to a smaller sampling error than the independent sample design.

The primary reason is that in a matched sample design, variation between workers is eliminated because the same workers are used for both production methods.

The key to the analysis of the matched sample design is to realize that we consider only the column of differences.

The matched sample design is generally preferred to the independent sample design because the matched-sample procedure often improves the precision of the estimate.

Let \({\mu}_d\) = the mean of the difference in values for the population

Definition 25.8 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_d = 0 \iff {H_a}: {\mu}_d \neq 0\)

Sample Mean is given by (25.12) like (18.6) and Sample Standard Deviation is given by (25.13) like (18.12)

\[\overline{d} = \frac{\sum{{d}_i}}{n} \tag{25.12}\]

\[{s}_d = \sqrt{\frac{\sum ({d}_i - \overline{d})^2}{n-1}} \tag{25.13}\]

Test Statistic with \((n-1)\) degrees of freedom : Refer (25.14) like (24.3)

\[t = \frac{\overline{d} - {\mu}_d}{{s}_d/\sqrt{n}} \tag{25.14}\]

Margin of Error (\(\text{MOE}_{{\gamma}}\)) : Refer (25.15) like (25.3)

\[\text{MOE}_{{\gamma}} = {t}_{\frac{{\alpha}}{2}}\frac{{s}_d}{\sqrt{n}} \tag{25.15}\]

\(\text{Interval Estimate}_{\gamma}\) : Refer (25.16) like (25.4)

\[\text{Interval Estimate}_{\gamma} = \overline{d} \pm {t}_{\frac{{\alpha}}{2}}\frac{{s}_d}{\sqrt{n}} \tag{25.16}\]

“ForLater” - Exercise

Example

Example: Two Production Methods

  • (1: Method 1) \(\{6, 5, 7, 6.2, 6, 6.4\}\)
  • (2: Method 2) \(\{5.4, 5.2, 6.5, 5.9, 6, 5.8\}\)
  • Difference: \({d}_i =\{0.6, -0.2, 0.5, 0.3, 0, 0.6\}\)
  • \({n} = 6, {\overline{d}} = 0.3, {{s}_d} = 0.335, \text{DOF} = 5, \text{SE} = {{s}_d}/\sqrt{n} = 0.1366\)
  • (25.14) \(t = \frac{\overline{d} - {\mu}_d}{{s}_d/\sqrt{n}} = \frac{0.30 - 0}{0.335/\sqrt{6}} = 2.196\)
    • {mean(bb$d) - 0 } / {sd(bb$d) / sqrt(length(bb$d))} \(\#\mathcal{R}\)
  • Calculate \({}^2\!P_{(t)}\)
    • \({}^2\!P_{(t = 2.196)} = 2 * {}^U\!P_{(t = 2.196)} = 2 * 0.0397 = 0.07949\)
      • 2 * pt(q = 2.196, df = 5, lower.tail = FALSE) \(\#\mathcal{R}\)
  • Compare with \({\alpha} = 0.05\)
    • \({}^2\!P_{(z)} > {\alpha} \to {H_0}\) cannot be rejected
  • 95% Confidence Interval can also be estimated
    • \({\gamma = 0.95} \iff{\alpha} = 0.05 \to {{}^2\!t_{{\alpha}/2}} = {{}^2\!t_{0.025}} = 2.57\)
    • qt(p = 0.025, df = 5, lower.tail = FALSE) \(\#\mathcal{R}\)
  • (25.15) \(\text{MOE}_{\gamma =0.95} = 2.57 * 0.1366 = 0.35\)
  • (25.16) \(\text{Interval Estimate}_{\gamma} = 0.3 \pm 0.35\)

Code

# #Matched Samples: Same workers providing data for two methods
xxMatchedMethods <- tibble(M1 = c(6, 5, 7, 6.2, 6, 6.4), 
                           M2 = c(5.4, 5.2, 6.5, 5.9, 6, 5.8))
aa <- xxMatchedMethods
# #Get Differnce
bb <- aa %>% mutate(d = M1-M2)
str(bb)
## tibble [6 x 3] (S3: tbl_df/tbl/data.frame)
##  $ M1: num [1:6] 6 5 7 6.2 6 6.4
##  $ M2: num [1:6] 5.4 5.2 6.5 5.9 6 5.8
##  $ d : num [1:6] 0.6 -0.2 0.5 0.3 0 ...
paste0(round(bb[3], 1))
## [1] "c(0.6, -0.2, 0.5, 0.3, 0, 0.6)"
#
cat(paste0("- ${n} = ", length(bb$d), ", {\\overline{d}} = ", round(mean(bb$d), 1), 
           ", {{s}_d} = ", round(sd(bb$d), 3), "$\n"))
## - ${n} = 6, {\overline{d}} = 0.3, {{s}_d} = 0.335$
cat(paste0("t = ", round({mean(bb$d) - 0 } / {sd(bb$d) / sqrt(length(bb$d))}, 3), "\n"))
## t = 2.196
#
# #Paired t-test
bb <- aa %>% pivot_longer(everything(), names_to = "key", values_to = "value")
#
# #Welch Two Sample t-test
# #Alternative must be: "two.sided" (Default), "less", "greater"
bb_ha <- "two.sided"
bb_testT <- t.test(formula = value ~ key, data = bb, alternative = bb_ha, paired = TRUE)
bb_testT
## 
##  Paired t-test
## 
## data:  value by key
## t = 2.1958, df = 5, p-value = 0.07952
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.05120834  0.65120834
## sample estimates:
## mean of the differences 
##                     0.3

25.8 Population Proportions

To make an inference about the difference between the two population proportions \(({p}_1 - {p}_2)\), we select a simple random sample of \({n}_1\) units from population 1 and a second simple random sample of \({n}_2\) units from population 2. Let \(\overline{{p}}_1, \overline{{p}}_2\) denote the sample proportions for simple random sample from populations 1 and 2.

\(({x})\) denotes Count of Success

Interval Estimation of \(({p}_1 - {p}_2)\)

The point estimator of the difference between two population proportions is the difference between the sample proportions of two independent simple random samples.

Point Estimate : Refer (25.17) like (25.1)

\[E_{( {\overline{p}}_1 - {\overline{p}}_2 )} = {\overline{p}}_1 - {\overline{p}}_2 \tag{25.17}\]

As with other point estimators, the point estimator \(E_{( {\overline{p}}_1 - {\overline{p}}_2 )}\) has a standard error \({\sigma}_{({\overline{p}}_1 - {\overline{p}}_2)}\), that describes the variation in the sampling distribution of the estimator.

The two population proportions, \(({p}_1, {p}_2)\), are unknown. Thus, sample proportions \(({\overline{p}}_1, {\overline{p}}_2)\) are being used to estimate them.

Standard Error of \(({\overline{p}}_1 - {\overline{p}}_2)\) : Refer equation (25.18) like (25.2)

\[\begin{align} {\sigma}_{({\overline{p}}_1 - {\overline{p}}_2)} &= \sqrt{\frac{{p}_1 (1-{p}_1)}{{n}_1} + \frac{{p}_2 (1-{p}_2)}{{n}_2}} \\ &= \sqrt{\frac{{\overline{p}}_1 (1-{\overline{p}}_1)}{{n}_1} + \frac{{\overline{p}}_2 (1-{\overline{p}}_2)}{{n}_2}} \end{align} \tag{25.18}\]

Margin of Error (\(\text{MOE}_{{\gamma}}\)) : Refer equation (25.19) like (25.3)

\[\text{MOE}_{{\gamma}} = {z}_{\frac{{\alpha}}{2}}{\sigma}_{({\overline{p}}_1 - {\overline{p}}_2)} = {z}_{\frac{{\alpha}}{2}}\sqrt{\frac{{\overline{p}}_1 (1-{\overline{p}}_1)}{{n}_1} + \frac{{\overline{p}}_2 (1-{\overline{p}}_2)}{{n}_2}} \tag{25.19}\]

\(\text{Interval Estimate}_{\gamma}\) : Refer equation (25.20) like (25.4)

\[\text{Interval Estimate}_{\gamma} = ({\overline{p}}_1 - {\overline{p}}_2) \pm {z}_{\frac{{\alpha}}{2}}\sqrt{\frac{{\overline{p}}_1 (1-{\overline{p}}_1)}{{n}_1} + \frac{{\overline{p}}_2 (1-{\overline{p}}_2)}{{n}_2}} \tag{25.20}\]

Example: Tax Preparation: (Count of Success \(({x})\) is Number of Returns with Errors)

  • (1: Office 1) \(\{{n}_1 = 250, {x}_1 = 35\} \to {\overline{p}}_1 = {n}_1/{x}_1 = 0.14\)
  • (2: Office 2) \(\{{n}_2 = 300, {x}_2 = 27\} \to {\overline{p}}_2 = {n}_2/{x}_2 = 0.09\)
  • point estimate
    • (25.17) \(E_{( {\overline{p}}_1 - {\overline{p}}_2 )} = 0.14 - 0.09 = 0.05\)
      • Thus, we estimate that Office 1 has a .05, or 5%, greater error rate than Office 2.
  • For \({\gamma = 0.90} \iff{\alpha} = 0.10 \to {z_{{\alpha}/2}} = {z_{0.05}} = 1.645\)
    • qnorm(p = 0.05, lower.tail = FALSE) \(\#\mathcal{R}\)
  • (25.18) \({\sigma}_{({\overline{p}}_1 - {\overline{p}}_2)} = 0.0275\)
    • sqrt(0.14 * (1 - 0.14) / 250 + 0.09 * (1 - 0.09) / 300) \(\#\mathcal{R}\)
  • (25.19) \(\text{MOE}_{\gamma} = {z}_{\frac{{\alpha}}{2}} {\sigma}_{({\overline{p}}_1 - {\overline{p}}_2)} = 1.645 * 0.0275 = 0.045\)
  • (25.20) \(\text{Interval Estimate}_{\gamma} = ({\overline{p}}_1 - {\overline{p}}_2) \pm \text{MOE}_{\gamma} = 0.05 \pm 0.045\)

25.8.1 Hypothesis

Definition 25.9 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {p}_1 - {p}_2 \geq 0 \iff {H_a}: {p}_1 - {p}_2 < 0\)
Definition 25.10 \(\text{\{Right or Upper\} } {H_0} : {p}_1 - {p}_2 \leq 0 \iff {H_a}: {p}_1 - {p}_2 > 0\)
Definition 25.11 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {p}_1 - {p}_2 = 0 \iff {H_a}: {p}_1 - {p}_2 \neq 0\)

When we assume \({H_0}\) is true as an equality, we have \({p}_1 - {p}_2 = 0\), which is the same as saying that the population proportions are equal, \({p}_1 = {p}_2 = {p}\). The equation (25.18) becomes (25.21)

Standard Error of \(({\overline{p}}_1 - {\overline{p}}_2)\) : When \({p}_1 = {p}_2 = {p}\)

\[\begin{align} {\sigma}_{({\overline{p}}_1 - {\overline{p}}_2)} &= \sqrt{{p} (1-{p})\left(\frac{1}{{n}_1} + \frac{1}{{n}_2}\right)} \\ &= \sqrt{{\overline{p}} (1-{\overline{p}})\left(\frac{1}{{n}_1} + \frac{1}{{n}_2}\right)} \end{align} \tag{25.21}\]

With \({p}\) unknown, we pool, or combine, the point estimators from the two samples \(({p}_1, {p}_2)\) to obtain a single point estimator of \({p}\) is given by (25.22)

Pooled Estimator of \({p}\) : When \({p}_1 = {p}_2 = {p}\)

\[{\overline{p}} = \frac{{n}_1 {\overline{p}}_1 + {n}_2 {\overline{p}}_2}{{n}_1 + {n}_2} \tag{25.22}\]

Test Statistic for Hypothesis Tests :

\[z = \frac{({\overline{p}}_1 - {\overline{p}}_2)}{{\sigma}_{({\overline{p}}_1 - {\overline{p}}_2)}} = \frac{({\overline{p}}_1 - {\overline{p}}_2)}{\sqrt{{\overline{p}} (1-{\overline{p}})\left(\frac{1}{{n}_1} + \frac{1}{{n}_2}\right)}} \tag{25.23}\]

Example: Tax Preparation: Continuation

25.11 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {p}_1 - {p}_2 = 0 \iff {H_a}: {p}_1 - {p}_2 \neq 0\)

  • (25.22) \({\overline{p}} = \frac{{n}_1 {\overline{p}}_1+ {n}_2 {\overline{p}}_2}{{n}_1 + {n}_2} = \frac{250 * 0.14 + 300 * 0.09}{250 + 300} = 0.1127\)
    • {250 * 0.14 + 300 * 0.09} / {250 + 300} \(\#\mathcal{R}\)
  • (25.23) \(z = \frac{({\overline{p}}_1 - {\overline{p}}_2)}{{\sigma}_{({\overline{p}}_1 - {\overline{p}}_2)}} = \frac{({\overline{p}}_1 - {\overline{p}}_2)}{\sqrt{{\overline{p}} (1-{\overline{p}})\left(\frac{1}{{n}_1} + \frac{1}{{n}_2}\right)}}\)
    • \(z = \frac{(0.14 - 0.09)}{\sqrt{0.1127 (1-0.1127)\left(\frac{1}{250} + \frac{1}{300}\right)}} = \frac{(0.14 - 0.09)}{0.0271} = 1.845\)
    • NOTE: The denominator calculated in this manner \((0.0271)\) is very close to the originally calculated \({\sigma}_{({\overline{p}}_1 - {\overline{p}}_2)} = 0.0275\)
  • Calculate \({}^2\!P_{(z)}\)
    • \({}^2\!P_{(z = 1.845)} = 2 * {}^U\!P_{(z = 1.845)} = 2 * 0.0325 = 0.065\)
      • 2 * pnorm(q = 1.845, lower.tail = FALSE) \(\#\mathcal{R}\)
  • Compare with \({\alpha} = 0.01\)
    • \({}^2\!P_{(z)} < {\alpha} \to {H_0}\) is rejected i.e. the proportions are different
    • The firm can conclude that the error rates differ between the two offices.

Code

(External) Two-Proportions Z-Test

bb_ha <- "two.sided"
bb_gamma <- 0.90
bb_propT <- prop.test(x = c(35, 27), n = c(250, 300), alternative = bb_ha, 
                      conf.level = bb_gamma, correct = FALSE)
bb_propT
## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  c(35, 27) out of c(250, 300)
## X-squared = 3.4084, df = 1, p-value = 0.06486
## alternative hypothesis: two.sided
## 90 percent confidence interval:
##  0.004815898 0.095184102
## sample estimates:
## prop 1 prop 2 
##   0.14   0.09
names(bb_propT)
## [1] "statistic"   "parameter"   "p.value"     "estimate"    "null.value"  "conf.int"   
## [7] "alternative" "method"      "data.name"
#
# #X-squared might be the square of calculated z-value
cat(paste0("X-squared is the number of Successes (X-squared = ", round(bb_propT$statistic, 6), ")\n"))
## X-squared is the number of Successes (X-squared = 3.408415)
#
# #By default, the function prop.test() uses the Yates continuity correction
# #It is important if either the expected successes or failures is < 5. 
# #If you do not want the correction, use the additional argument correct = FALSE. 
# #i.e. To make the test mathematically equivalent to the uncorrected z-test of a proportion.
cat(paste0("p-value is the significance level of the t-test (p-value = ", 
           round(bb_propT$p.value, 6), ")\n"))
## p-value is the significance level of the t-test (p-value = 0.064865)
cat(paste0("conf.int is the confidence interval for the probability of success at", bb_gamma,
           " level (conf.int = [", paste0(round(bb_propT$conf.int, 3), collapse = ", "), "])\n"))
## conf.int is the confidence interval for the probability of success at0.9 level (conf.int = [0.005, 0.095])
cat(paste0("sample estimates is the the estimated probability of success. p: ", 
           paste0(round(bb_propT$estimate, 2), collapse = ", "), "\n"))
## sample estimates is the the estimated probability of success. p: 0.14, 0.09

Validation


26 Variance

26.1 Overview

  • “Inferences About Population Variances”
    • “ForLater” - Hypothesis Testing, Inferences About Two Population Variances

26.2 Inferences About a Population Variance

In many manufacturing applications, controlling the process variance is extremely important in maintaining quality.

The sample variance \({s^2}\), given by equation (26.1), is the point estimator of the population variance \({\sigma}^2\).

\[{s^2} = \frac{\sum {({x}_i - {\overline{x}})}^2}{n-1} \tag{26.1}\]

Definition 26.1 Whenever a simple random sample of size \({n}\) is selected from a normal population, the sampling distribution of \(\frac{(n-1)s^2}{{\sigma}^2}\) is a chi-square distribution with \({n − 1}\) degrees of freedom.

Note:

  • The chi-square distribution is based on sampling from a normal population.
  • It can be used to develop interval estimates and conduct hypothesis tests about a population variance.
  • The notation \({\chi_{\alpha}^2}\) denotes the value for the chi-square distribution that provides an area or probability of \({\alpha}\) to the right of the \({\chi_{\alpha}^2}\) value.

26.2.1 Interval Estimation

Example: A sample of 20 containers \({n = 20}\) has the sample variance \({s^2} = 0.0025\)

  • \({\chi_{\alpha = 0.025}^2} = 32.852\)
    • qchisq(p = 0.025, df = 19, lower.tail = FALSE) \(\#\mathcal{R}\)
    • It indicates that 2.5% of the chi-square values are to the right of 32.852
    • Also, \({\chi_{\alpha = 0.975}^2} = 8.907\) indicates that 97.5% of the chi-square values are to the right of 8.907.
      • qchisq(p = 0.975, df = 19, lower.tail = FALSE) \(\#\mathcal{R}\)
    • Thus, 95% of the chi-square values are between \({\chi_{\alpha = 0.975}^2}\) and \({\chi_{\alpha = 0.025}^2}\)
    • There is a .95 probability of obtaining a \({\chi^2}\) value such that \({\chi_{0.975}^2} \leq {\chi^2} \leq{\chi_{0.025}^2}\)
# #pnorm() qnorm() | pt() qt() | pchisq() qchisq() | pf() qf() 
#
# #p-value approach: Find Commulative Probability P corresponding to the given ChiSq & DOF=59
pchisq(q = 32.852, df = 19, lower.tail = FALSE)
## [1] 0.02500216
#
# #ChiSq value for which Area under the curve towards Right is alpha=0.025 & DOF=19 #32.852
qchisq(p = 0.025, df = 19, lower.tail = FALSE)
## [1] 32.85233
  • Using equation (26.1), we can get (26.2), which provides a 95% confidence interval estimate for the population variance \({\sigma}^2\).

\[\frac{(n-1)s^2}{{\chi_{0.025}^2}} \leq {\sigma}^2 \leq \frac{(n-1)s^2}{{\chi_{0.975}^2}} \tag{26.2}\]

  • In the example, \((n-1)s^2 = 19 * 0.0025 = 0.0475\)
  • (26.2) \(\frac{0.0475}{32.852} \leq {\sigma}^2 \leq \frac{0.0475}{8.907} \to 0.0014 \leq {\sigma}^2 \leq 0.0053 \to 0.0380 \leq {\sigma} \leq 0.0730\)
    • which gives the 95% confidence interval for the population standard deviation

Generalising the equation (26.2), the equation (26.3) is the interval estimate of a population variance.

\[\frac{(n-1)s^2}{{\chi_{{\alpha}/2}^2}} \leq {\sigma}^2 \leq \frac{(n-1)s^2}{{\chi_{{1-\alpha}/2}^2}} \tag{26.3}\]

where the \({\chi^2}\) values are based on a chi-square distribution with \({n-1}\) degrees of freedom and where \((1 − {\alpha})\) is the confidence coefficient.

26.2.2 Hypothesis Tests

Using \({{\sigma}_0^2}\) to denote the hypothesized value for the population variance, the three forms for a hypothesis test are as follows:

Definition 26.2 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\sigma}^2 \geq {{\sigma}_0^2} \iff {H_a}: {\sigma}^2 < {{\sigma}_0^2}\)
Definition 26.3 \(\text{\{Right or Upper\} } {H_0} : {\sigma}^2 \leq {{\sigma}_0^2} \iff {H_a}: {\sigma}^2 > {{\sigma}_0^2}\)
Definition 26.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\sigma}^2 = {{\sigma}_0^2} \iff {H_a}: {\sigma}^2 \neq {{\sigma}_0^2}\)

Note: In general, Upper Tail test is the most frequently observed test because low variances are generally desirable. With a statement about the maximum allowable population variance, we can test the null hypothesis that the population variance is less than or equal to the maximum allowable value against the alternative hypothesis that the population variance is greater than the maximum allowable value. With this test structure, corrective action will be taken whenever rejection of the null hypothesis indicates the presence of an excessive population variance.

Test Statistic for Hypothesis Tests About a Population Variance: Refer (26.4), where \({\chi^2}\) has a chi-square distribution with \({n - 1}\) degrees of freedom.

\[{\chi^2} = \frac{(n - 1){s}^2}{{\sigma}_0^2} \tag{26.4}\]

Example: Louis: the company standard specifies an arrival time variance of 4 or less

26.3 \(\text{\{Right or Upper\} } {H_0} : {\sigma}^2 \leq {{\sigma}_0^2} \iff {H_a}: {\sigma}^2 > {{\sigma}_0^2}\)

  • (Sample) \({n} = 24, {s}^2 = 4.9\)
  • (26.4) \({\chi^2} = \frac{(n - 1){s}^2}{{\sigma}_0^2} = \frac{(24 - 1) * 4.9}{4} = 28.18\)
  • \({}^U\!P_{(\chi^2)} = 0.209\)
    • pchisq(q = 28.18, df = 23, lower.tail = FALSE) \(\#\mathcal{R}\)
  • Compare with \({\alpha} = 0.05\)
    • \({}^U\!P_{(\chi^2)} > {\alpha} \to {H_0}\) cannot be rejected
    • The sample results do not provide sufficient evidence to conclude that the variance is high.

Example: bureau of motor vehicles: Evaluate the variance in the new examination test scores with the historical value \({\sigma}_0^2 = 100\)

26.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\sigma}^2 = {{\sigma}_0^2} \iff {H_a}: {\sigma}^2 \neq {{\sigma}_0^2}\)

  • (Sample) \({n} = 30, {s}^2 = 162\)
  • (26.4) \({\chi^2} = \frac{(n - 1){s}^2}{{\sigma}_0^2} = \frac{(30 - 1) * 162}{100} = 46.98\)
  • \({}^2\!P_{(\chi^2)} = 2 * {}^U\!P_{(\chi^2)} = 2 * 0.0187 = 0.0374\)
    • 2 * pchisq(q = 46.98, df = 29, lower.tail = FALSE) \(\#\mathcal{R}\)
  • Compare with \({\alpha} = 0.05\)
    • \({}^2\!P_{(\chi^2)} < {\alpha} \to {H_0}\) is rejected i.e. the change is significant
    • The new examination test scores have a population variance different from the historical variance

26.3 Inferences About Two Population Variances

The two sample variances \({s}_1^2\) and \({s}_2^2\) will be the basis for making inferences about the two population variances \({\sigma}_1^2\) and \({\sigma}_2^2\).

Definition 26.5 Whenever independent simple random samples of sizes \({n}_1\) and \({n}_2\) are selected from two normal populations with equal variances \(({\sigma}_1^2 = {\sigma}_2^2)\), the sampling distribution of \(\frac{{s}_1^2}{{s}_2^2}\) is an F distribution with \(({n}_1 - 1)\) degrees of freedom for the numerator and \(({n}_2 - 1)\) degrees of freedom for the denominator.

Note:

  • The F distribution is based on sampling from two normal populations.
  • The F distribution is not symmetric, and the F values can never be negative.
    • The shape of any particular F distribution depends on its numerator and denominator degrees of freedom.
  • We refer to the population providing the larger sample variance as population 1.
    • Because the F test statistic is constructed with the larger sample variance \({s}_1^2\) in the numerator, the value of the test statistic will be in the upper tail of the F distribution.

Test Statistic for Hypothesis Tests About Population Variances with \(({\sigma}_1^2 = {\sigma}_2^2)\) : Refer equation (26.5)

\[F = \frac{{s}_1^2}{{s}_2^2} \tag{26.5}\]

Hypothesis Tests :

Definition 26.6 \(\text{\{Left or Lower \} }\space\thinspace \text{Do not do this.}\)
Definition 26.7 \(\text{\{Right or Upper\} } {H_0} : {\sigma}_1^2 \leq {\sigma}_2^2 \iff {H_a}: {\sigma}_1^2 > {\sigma}_2^2\)
Definition 26.8 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\sigma}_1^2 = {\sigma}_2^2 \iff {H_a}: {\sigma}_1^2 \neq {\sigma}_2^2\)

Example: Dullus County Schools:

26.8 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\sigma}_1^2 = {\sigma}_2^2 \iff {H_a}: {\sigma}_1^2 \neq {\sigma}_2^2\)

  • (1: Milbank) \({n} = 26, {s}_1^2 = 48\)
  • (2: Gulf ) \({n} = 16, {s}_1^2 = 20\)
  • (26.5) \(F = \frac{{s}_1^2}{{s}_2^2} = \frac{48}{20} = 2.4\)
  • \({}^2\!P_{(F)} = 2 * {}^U\!P_{(F)} = 2 * 0.0406 = 0.0812\)
    • 2 * pf(q = 2.4, df1 = 25, df2 = 15, lower.tail = FALSE) \(\#\mathcal{R}\)
  • Compare with \({\alpha} = 0.10\)
    • \({}^2\!P_{(F)} < {\alpha} \to {H_0}\) is rejected i.e. The two populations have difference variances
    • The sample results provide sufficient evidence to conclude that the variances are different.

Example: public opinion survey: do women show a greater variation in attitude on political issues than men

26.7 \(\text{\{Right or Upper\} } {H_0} : {\sigma}_1^2 \leq {\sigma}_2^2 \iff {H_a}: {\sigma}_1^2 > {\sigma}_2^2\)

  • (1: Women) \({n} = 41, {s}_1^2 = 120\)
  • (2: Men) \({n} = 31, {s}_1^2 = 80\)
  • (26.5) \(F = \frac{{s}_1^2}{{s}_2^2} = \frac{120}{80} = 1.5\)
  • \({}^U\!P_{(F)} = 0.1256\)
    • pf(q = 1.5, df1 = 40, df2 = 30, lower.tail = FALSE) \(\#\mathcal{R}\)
  • Compare with \({\alpha} = 0.05\)
    • \({}^U\!P_{(F)} > {\alpha} \to {H_0}\) cannot be rejected i.e. The two populations have same variances
    • The sample results does not provide sufficient evidence to conclude that women have higher variance in political opinion compared to men

Validation


27 Independence

27.1 Overview

  • “Comparing Multiple Proportions, Test of Independence and Goodness of Fit”
    • “ForLater” - Everything

27.2 Introduction

Hypothesis-testing procedures that expand our capacity for making statistical inferences about populations

  • The test statistic used in conducting the hypothesis tests in this chapter is based on the chi-square \({\chi^2}\) distribution.
  • In all cases, the data are categorical.
  • Applications
    • Testing the equality of population proportions for three or more populations
    • Testing the independence of two categorical variables
    • Testing whether a probability distribution for a population follows a specific historical or theoretical probability distribution

27.3 Testing the Equality of Population Proportions for Three or More Populations

Definition 27.1 \(\text{\{Equality of Population Proportions \}} {H_0} : {p}_1 = {p}_2 = \dots = {p}_k \iff {H_a}: \text{Not all population proportions are equal}\)

where \({p}_j\) is population proportion of the \(j^{\text{th}}\) population. We assume that a simple random sample of size \({n}_j\) has been selected from each of the \({k}\) populations or treatments.

Select a random sample from each of the populations and record the observed frequencies, \(f_{ij}\), in a table with 2 rows and k columns.

Expected Frequencies Under the Assumption \({H_0}\) is true : Refer equation (27.1)

\[e_{ij} = \frac{(\text{Row } i \text{ Total})(\text{Column } j \text{ Total})}{\text{Total Sample Size}} \tag{27.1}\]

Chi-Square Test Statistic : Refer equation (27.2)

\[{\chi^2} = \sum_{i}{\sum_{j}{\frac{(f_{ij} - e_{ij})^2}{e_{ij}}}} \tag{27.2}\]

Where:

\[\begin{align} f_{ij} &= \text{observed frequency for the cell in row } i \text{ and column } j \\ e_{ij} &= \text{expected frequency for the cell in row } i \text{ and column } j \end{align}\]

Note: In a chi-square test involving the equality of \({k}\) population proportions, the above test statistic has a chi-square distribution with \({k - 1}\) degrees of freedom provided the expected frequency is 5 or more for each cell.

A chi-square test for equal population proportions will always be an upper tail test with rejection of \({H_0}\) occurring when the test statistic is in the upper tail of the chi-squre distribution.

In studies such as these, we often use the same sample size for each population. We have chosen different sample sizes in this example to show that the chi-square test is not restricted to equal sample sizes for each of the k populations.

“ForLater” - Creating the ChiSq Table

Example: JD Power: Compare customer loyalty for three automobiles by using the proportion of owners likely to repurchase a particular automobile

  • Count of Success \(({x})\) is Number of Owners likely to repurchase
  • (1: Impala) \(\{{n}_1 = 125, {x}_1 = 69\}\)
  • (1: Fusion) \(\{{n}_2 = 200, {x}_2 = 120\}\)
  • (1: Accord) \(\{{n}_2 = 175, {x}_3 = 123\}\)
  • (27.2) \({\chi^2} = 7.89\)
  • \(P_{\chi^2} = 0.0193\)
    • pchisq(q = 7.89, df = 2,lower.tail = FALSE) \(\#\mathcal{R}\)

“ForLater” - Marascuilo procedure

27.4 Test of Independence

An important application of a chi-square test involves using sample data to test for the independence of two categorical variables. The null hypothesis for this test is that the two categorical variables are independent. Thus, the test is referred to as a test of independence.

Example: Beer: Preference vs. gender

  • Since an objective of the study is to determine if there is difference between the beer preferences for male and female beer drinkers, we consider gender an explanatory variable and follow the usual practice of making the explanatory variable the column variable in the data tabulation table.
  • The beer preference is the categorical response variable and is shown as the row variable.
  • “ForLater”

27.5 Goodness of Fit Test

  • “ForLater”

27.6 Summary

All tests apply to categorical variables and all tests use a chi-square \({\chi^2}\) test statistic that is based on the differences between observed frequencies and expected frequencies. In each case, expected frequencies are computed under the assumption that the null hypothesis is true. These chi-square tests are upper tailed tests. Large differences between observed and expected frequencies provide a large value for the chi-square test statistic and indicate that the null hypothesis should be rejected.

The test for the equality of population proportions for three or more populations is based on independent random samples selected from each of the populations. The sample data show the counts for each of two categorical responses for each population. The null hypothesis is that the population proportions are equal. Rejection of the null hypothesis supports the conclusion that the population proportions are not all equal.

The test of independence between two categorical variables uses one sample from a population with the data showing the counts for each combination of two categorical variables. The null hypothesis is that the two variables are independent and the test is referred to as a test of independence. If the null hypothesis is rejected, there is statistical evidence of an association or dependency between the two variables.

The goodness of fit test is used to test the hypothesis that a population has a specific historical or theoretical probability distribution. We showed applications for populations with a multinomial probability distribution and with a normal probability distribution. Since the normal probability distribution applies to continuous data, intervals of data values were established to create the categories for the categorical variable required for the goodness of fit test.

Validation


28 ANOVA

28.1 Overview

  • “Experimental Design and Analysis of Variance”
    • “ForLater” - Exercises, Fisher LSD Onwards

28.2 Introduction

Example: Chemitech: Comparision of theree methods of assembly A, B, C in terms of most assemblies per week

  • In this experiment
    • assembly method is the independent variable or factor.
    • Because three assembly methods correspond to this factor, we say that three treatments are associated with this experiment; each treatment corresponds to one of the three assembly methods.
      • The three assembly methods or treatments define the three populations of interest
    • This is an example of a single-factor experiment; it involves one categorical factor (method of assembly)
    • For each population the dependent or response variable is the number of filtration systems assembled per week, and the primary statistical objective of the experiment is to determine whether the mean number of units produced per week is the same for all three populations (methods).
  • Suppose a random sample of three employees is selected from all assembly workers.
    • The three randomly selected workers are the experimental units.
    • The experimental design that we will use is called a completely randomized design. - This type of design requires that each of the three assembly methods or treatments be assigned randomly to one of the experimental units or workers.
Definition 28.1 Randomization is the process of assigning the treatments to the experimental units at random.
  • Suppose that instead of selecting just three workers at random we selected 15 workers and then randomly assigned each of the three treatments to 5 of the workers.
    • Because each method of assembly is assigned to 5 workers, we say that five replicates have been obtained.
  • As given in data, sample means for A, B, C are : \(\{{\overline{x}}_1 = 62, {\overline{x}}_1 = 66, {\overline{x}}_1 = 52\}\)
    • From these data, method B appears to result in higher production rates than either of the other methods.
xxChemitech <- tibble(A = c(58, 64, 55, 66, 67), 
                      B = c(58, 69, 71, 64, 68), 
                      C = c(48, 57, 59, 47, 49))
aa <- xxChemitech
# #Summary
aa %>% 
    pivot_longer(everything(), names_to = "key", values_to = "value") %>% 
    group_by(key) %>% 
    summarise(across(value, 
                     list(Count = length, Mean = mean, SD = sd, Variance = var), 
                     .names = "{.fn}"))
## # A tibble: 3 x 5
##   key   Count  Mean    SD Variance
##   <chr> <int> <dbl> <dbl>    <dbl>
## 1 A         5    62  5.24     27.5
## 2 B         5    66  5.15     26.5
## 3 C         5    52  5.57     31

28.3 Hypothesis

  • The real issue is whether the three sample means observed are different enough for us to conclude that the means of the populations corresponding to the three methods of assembly are different.
    • Let \(\{{\mu}_1, {\mu}_2, {\mu}_3\}\) denote mean number of units produced per week using methods A, B and C
    • we want to use the sample means to test the following hypotheses:

28.2 \(\text{\{ANOVA\}} {H_0} : {\mu}_1 = {\mu}_2 = \dots = {\mu}_k \iff {H_a}: \text{Not all population means are equal}\)

If \({H_0}\) is rejected, we cannot conclude that all population means are different. Rejecting \({H_0}\) means that at least two population means have different values.

28.4 Assumptions for Analysis of Variance

Three assumptions are required to use analysis of variance.

  1. For each population, the response variable is normally distributed.
    • Implication: In the example, the number of units produced per week (response variable) must be normally distributed for each assembly method.
  2. The variance of the response variable, \({\sigma}^2\), is the same for all of the populations.
    • Implication: In the example, the variance of the number of units produced per week must be the same for each assembly method.
  3. The observations must be independent.
    • Implication: In the example, the number of units produced per week for each employee must be independent of the number of units produced per week for any other employee.

If the sample sizes are equal, analysis of variance is not sensitive to departures from the assumption of normally distributed populations.

Normality:

  • In ANOVA, the entire response column is typically nonnormal because the different groups in the data have different means.

28.5 Conceptual Overview

If the means for the three populations are equal, we would expect the three sample means to be close together. The more the sample means differ, the stronger the evidence we have for the conclusion that the population means differ. In other words, if the variability among the sample means is “small,” it supports \({H_0}\); if the variability among the sample means is “large,” it supports \({H_a}\).

If the null hypothesis is true, we can use the variability among the sample means to develop an estimate of \({\sigma}^2\).

First, note that if the assumptions for analysis of variance are satisfied and the null hypothesis is true, each sample will have come from the same normal distribution with mean \({\mu}\) and variance \({\sigma}^2\).

Recall that the sampling distribution of the sample mean \({\overline{x}}\) for a simple random sample of size \({n}\) from a normal population will be normally distributed with mean \({\mu}\) and variance \({\sigma}_{{\overline{x}}}^2 = \frac{{\sigma}^2}{n}\).

In this case, the mean and variance of the three sample mean values \(\{{\overline{x}}_1 = 62, {\overline{x}}_1 = 66, {\overline{x}}_1 = 52\}\) can be used to estimate the mean and variance of the sampling distribution.

When the sample sizes are equal, as in this example, the best estimate of the mean of the sampling distribution of \({\overline{x}}\) is the mean or average of the sample means.

In this example, an estimate of the mean of the sampling distribution of \({\overline{x}}\) is \((62 + 66 + 52)/3 = 60\). We refer to this estimate as the overall sample mean. Refer equation (28.5)

An estimate of the variance of the sampling distribution of \({\overline{x}}\), \({\sigma}_{{\overline{x}}}^2\), is provided by the variance of the three sample means.

\[{s}_{\overline{x}}^2 = \frac{(62 - 60)^2 + (66 - 60)^2 + (52 - 60)^2}{3 - 1} = 52\] Because \({\sigma}^2 = n {\sigma}_{{\overline{x}}}^2\), the estimate can be given by

\[E_{{\sigma}^2} = n E_{{\sigma}_{{\overline{x}}}^2} = n {s}_{\overline{x}}^2 = 5 * 52 = 260\]

The \(n {s}_{\overline{x}}^2\) is referred as between-treatments estimate of \({\sigma}^2\). It is based on the assumption that the null hypothesis is true. In this case, each sample comes from the same population, and there is only one sampling distribution of \({\overline{x}}\).

In contrast, when the population means are not equal, the between-treatments estimate will overestimate the population variance \({\sigma}^2\).

The variation within each of the samples also has an effect on the conclusion we reach in analysis of variance. When a simple random sample is selected from each population, each of the sample variances provides an unbiased estimate of \({\sigma}^2\). Hence, we can combine or pool the individual estimates of \({\sigma}^2\) into one overall estimate. The estimate of \({\sigma}^2\) obtained in this way is called the pooled or within-treatments estimate of \({\sigma}^2\).

Because each sample variance provides an estimate of \({\sigma}^2\) based only on the variation within each sample, the within-treatments estimate of \({\sigma}^2\) is not affected by whether the population means are equal. When the sample sizes are equal, the within-treatments estimate of \({\sigma}^2\) can be obtained by computing the average of the individual sample variances \(\{27.5, 26.5, 31\}\).

For this exmple we obtain:

\[\text{Within-treatments estimate of } {\sigma}^2 = \frac{27.5 + 26.5 + 31}{3} = 28.33\] Remember, that the between-treatments approach provides a good estimate of \({\sigma}^2\) only if the null hypothesis is true; if the null hypothesis is false, the between-treatments approach overestimates \({\sigma}^2\). The within-treatments approach provides a good estimate of \({\sigma}^2\) in either case.

Thus, if the null hypothesis is true, the two estimates will be similar and their ratio will be close to 1.

If the null hypothesis is false, the between-treatments estimate will be larger than the within-treatments estimate, and their ratio will be large.

28.6 ANOVA

Analysis of Variance and the Completely Randomized Design

Definition 28.2 \(\text{\{ANOVA\}} {H_0} : {\mu}_1 = {\mu}_2 = \dots = {\mu}_k \iff {H_a}: \text{Not all population means are equal}\)

where \({\mu}_j\) is mean of the \(j^{\text{th}}\) population. We assume that a simple random sample of size \({n}_j\) has been selected from each of the \({k}\) populations or treatments. For the resulting sample data, let

\[\begin{align} {x}_{ij} &= \text{value of observation } i \text{ for treatment } j \\ {n}_{j} &= \text{number of observations for treatment } j \\ {\overline{x}}_{j} &= \text{sample mean for treatment } j \\ {s}_{j}^2 &= \text{sample variance for treatment } j \\ {s}_{j} &= \text{sample e standard deviation for treatment } j \end{align}\]

The formulas for the sample mean and sample variance for treatment \({j}\) are given in equations (28.1) and (28.2)

\[{\overline{x}}_j = \frac{\sum_{i=1}^{n_j}{x}_{ij}}{{n}_j} \tag{28.1}\]

\[{s}_j^2 = \frac{\sum_{i=1}^{n_j}{\left({x}_{ij} - {\overline{x}}_j\right)^2}}{{n}_j - 1} \tag{28.2}\]

The overall sample mean, denoted \({\overline{\overline{x}}}\), is the sum of all the observations divided by the total number of observations.

\[{\bar{\bar{x}}} = \frac{\sum_{j=1}^k{\sum_{i=1}^{{n}_j}{{x}_{ij}}}}{{n}_T} \tag{28.3}\]

Where

\[{n}_T = {n}_1 + {n}_2 + \cdots + {n}_k \tag{28.4}\]

If the size of each sample is {n}, the equation (28.4) becomes \({n}_T = kn\), and the equation (28.3) reduces to (28.5)

\[{\bar{\bar{x}}} = \frac{\sum_{j=1}^k{{\overline{x}}_{j}}}{k} \tag{28.5}\]

Thus, whenever the sample sizes are the same, the overall sample mean is just the average of the \({k}\) sample means.

Thus, in the example, from (28.5), \({\bar{\bar{x}}} = \frac{62 + 66 + 52}{3} = 60\)

28.7 MSTR

Between-Treatments Estimate of Population Variance

The between-treatments estimate of \({\sigma}^2\) is called the mean square due to treatments and is denoted \(\text{MSTR}\). Refer equation (28.6)

\[\text{MSTR} = \frac{\text{SSTR}}{k - 1} = \frac{\sum_{j=1}^{k}{n}_j\left({\overline{x}}_j - {\bar{\bar{x}}} \right)^2}{k - 1} \tag{28.6}\]

The numerator in equation (28.6) is called the sum of squares due to treatments and is denoted \(\text{SSTR}\). The denominator, \({k − 1}\), represents the degrees of freedom associated with SSTR. Refer equation (28.7)

\[\text{SSTR} = \sum_{j=1}^{k}{n}_j\left({\overline{x}}_j - {\bar{\bar{x}}} \right)^2 \tag{28.7}\]

If \({H_0}\) is true, MSTR provides an unbiased estimate of \({\sigma}^2\). However, if the means of the \({k}\) populations are not equal, MSTR is not an unbiased estimate of \({\sigma}^2\); in fact, in that case, MSTR should overestimate \({\sigma}^2\).

In the example:

  • (28.7) \(\text{SSTR} = 5(62 - 60)^2 + 5(66 - 60)^2 + 5(52 - 60)^2 = 520\)
  • (28.6) \(\text{MSTR} = \frac{520}{3 - 1} = 260\)

28.8 MSE

Within-Treatments Estimate of Population Variance

The within-treatments estimate of \({\sigma}^2\) is called the mean square due to error and is denoted \(\text{MSE}\). Refer equation (28.8)

\[\text{MSE} = \frac{\text{SSE}}{{n}_T - k} = \frac{\sum_{j=1}^{k}{({n}_j - 1){s}_j^2}}{{n}_T - k} \tag{28.8}\]

The numerator in equation (28.8) is called the sum of squares due to error and is denoted \(\text{SSE}\). The denominator, \({{n}_T - k}\) is referred to as the degrees of freedom associated with SSE. Refer equation (28.9)

\[\text{SSE} = \sum_{j=1}^{k}{({n}_j - 1){s}_j^2} \tag{28.9}\]

In the example:

  • (28.9) \(\text{SSE} = (5 - 1)27.5 + (5 - 1)26.5 + (5 - 1)31 = 340\)
  • (28.8) \(\text{MSE} = \frac{340}{15 - 3} = 28.33\)

28.9 F test

Comparing the Variance Estimates

If the null hypothesis is true, \(\text{MSTR}\) and \(\text{MSE}\) provide two independent, unbiased estimates of \({\sigma}^2\).

Refer Variance.

We know that for normal populations, the sampling distribution of the ratio of two independent estimates of \({\sigma}^2\) follows an F distribution. Hence, if the null hypothesis is true and the ANOVA assumptions are valid, the sampling distribution of \(\frac{\text{MSTR}}{\text{MSE}}\) is an F distribution with numerator degrees of freedom equal to \({k - 1}\) and denominator degrees of freedom equal to \({{n}_T - k}\).

In other words, if the null hypothesis is true, the value of MSTR/MSE should appear to have been selected from this F distribution. However, if the null hypothesis is false, the value of \(\frac{\text{MSTR}}{\text{MSE}}\) will be inflated because MSTR overestimates \({\sigma}^2\). Hence, we will reject \({H_0}\) if the resulting value of \(\frac{\text{MSTR}}{\text{MSE}}\) appears to be too large to have been selected from an F distribution with \({k - 1}\) numerator degrees of freedom and \({{n}_T - k}\) denominator degrees of freedom.

Because the decision to reject \({H_0}\) is based on the value of \(\frac{\text{MSTR}}{\text{MSE}}\), the test statistic used to test for the equality of \({k}\) population means is given by equation (28.10)

Test Statistic for the Equality of \({k}\) Population Means :

\[F = \frac{\text{MSTR}}{\text{MSE}} \tag{28.10}\]

Because we will only reject the null hypothesis for large values of the test statistic, the p-value is the upper tail area of the F distribution to the right of the test statistic \({F}\).

  • aov()
    • Terms
      • Df degrees of freedom
        • for the independent variable (levels - 1)
        • and for the residuals (total observations - 1 - levels)
      • Sum Sq shows the sum of squares
        • SSTR: total variation between the group means
        • SSE: and the overall mean
      • Mean Sq shows the mean of the sum of squares
        • MSTR (Between): sum of squares / degrees of freedom for each parameter
        • MSE (Within): mean square of the residuals
      • F-value is the test statistic from the F test.
        • Mean square of each independent variable / mean square of the residuals.
        • The larger the F value, the more likely it is that the variation caused by the independent variable is real and not due to chance.
      • Pr(>F) is the p-value of the F-statistic.
        • likelihood that the F-value calculated from the test would have occurred if the null hypothesis of no difference among group means were true.
  • In the Example
    • Total Variance = Between or MSTR + Within or MSE
    • First Line (Column): \(\text{DOF}_{(k-1)} = 2, \text{SSTR} = 520, \text{MSTR} = 260\)
    • Residuals (Within) : \(\text{DOF}_{(n-k)} = 12, \text{SSE} = 340, \text{MSE} = 28.33\)
    • (28.10) \(F = \frac{\text{MSTR}}{\text{MSE}} = 9.18\)
  • Calculate \({}^U\!P_{(F)}\)
    • \({}^U\!P_{F = 9.18} = 0.0038\)
      • pf(q = 9.18, df1 = 2, df2 = 12, lower.tail = FALSE) \(\#\mathcal{R}\)
  • Compare with \({\alpha} = 0.05\)
    • \({}^U\!P_{(F)} < {\alpha} \to {H_0}\) is rejected i.e. the means are different
    • The test provides sufficient evidence to conclude that the means of the three populations are not equal.
    • Analysis of variance supports the conclusion that the population mean number of units produced per week for the three assembly methods are not equal

28.10 ANOVA Table

The sum of squares associated with the source of variation referred to as “Total” is called the total sum of squares (SST). SST divided by its degrees of freedom \({n}_T - 1\) is nothing more than the overall sample variance that would be obtained if we treated the entire set of 15 observations as one data set. With the entire data set as one sample. Refer equation (28.11)

\[\text{SST} = \text{SSTR} + \text{SSE} = \sum_{j=1}^k{\sum_{i=1}^{{n}_j}{\left( {x}_{ij} - \bar{\bar{x}}\right)^2}} \tag{28.11}\]

The degrees of freedom associated with this total sum of squares is the sum of the degrees of freedom associated with the sum of squares due to treatments and the sum of squares due to error i.e. \({n}_T - 1 = (k - 1) + ({n}_T - k)\).

In other words, SST can be partitioned into two sums of squares: the sum of squares due to treatments and the sum of squares due to error. Note also that the degrees of freedom corresponding to SST, \({n}_T - 1\), can be partitioned into the degrees of freedom corresponding to SSTR, \(k - 1\), and the degrees of freedom corresponding to SSE, \({n}_T - k\).

The analysis of variance can be viewed as the process of partitioning the total sum of squares and the degrees of freedom into their corresponding sources: treatments and error. Dividing the sum of squares by the appropriate degrees of freedom provides the variance estimates, the F value, and the p-value used to test the hypothesis of equal population means.

The square root of MSE provides the best estimate of the population standard deviation \({\sigma}\). This estimate of \({\sigma}\) on the computer output is Pooled StDev.

28.11 Code

str(aa)
## tibble [5 x 3] (S3: tbl_df/tbl/data.frame)
##  $ A: num [1:5] 58 64 55 66 67
##  $ B: num [1:5] 58 69 71 64 68
##  $ C: num [1:5] 48 57 59 47 49
bb <- aa %>% pivot_longer(everything(), names_to = "key", values_to = "value")
# 
# #ANOVA
ii_aov <- aov(formula = value ~ key, data = bb)
#names(ii_aov)
#ii_aov
#
# #
model.tables(ii_aov, type = "means")
## Tables of means
## Grand mean
##    
## 60 
## 
##  key 
## key
##  A  B  C 
## 62 66 52
#
# #Summary
summary(ii_aov)
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## key          2    520  260.00   9.176 0.00382 **
## Residuals   12    340   28.33                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

28.12 Summary

Analysis of variance (ANOVA) can be used to test for differences among means of several populations or treatments.

The completely randomized design and the randomized block design are used to draw conclusions about differences in the means of a single factor. The primary purpose of blocking in the randomized block design is to remove extraneous sources of variation from the error term. Such blocking provides a better estimate of the true error variance and a better test to determine whether the population or treatment means of the factor differ significantly.

The basis for the statistical tests used in analysis of variance and experimental design is the development of two independent estimates of the population variance \({\sigma}^2\). In the single-factor case, one estimator is based on the variation between the treatments; this estimator provides an unbiased estimate of \({\sigma}^2\) only if the means \(\{{\mu}_1, {\mu}_2, \ldots, {\mu}_k\}\) are all equal. A second estimator of \({\sigma}^2\) is based on the variation of the observations within each sample; this estimator will always provide an unbiased estimate of \({\sigma}^2\).

By computing the ratio of these two estimators (the F statistic), it is determined whether to reject the null hypothesis that the population or treatment means are equal.

In all the experimental designs considered, the partitioning of the sum of squares and degrees of freedom into their various sources enabled us to compute the appropriate values for the analysis of variance calculations and tests.

Further, Fisher LSD procedure and the Bonferroni adjustment can be used to perform pairwise comparisons to determine which means are different.

Validation


29 Simple Linear Regression

29.1 Overview

    • “ForLater” - Everything

Simple Linear Regression Model

Definition 29.1 Regression analysis can be used to develop an equation showing how two or more variables variables are related.
Definition 29.2 The variable being predicted is called the dependent variable \(({y})\).
Definition 29.3 The variable or variables being used to predict the value of the dependent variable are called the independent variables \(({x})\).
Definition 29.4 The simplest type of regression analysis involving one independent variable and one dependent variable in which the relationship between the variables is approximated by a straight line, is called simple linear regression.
Definition 29.5 The equation that describes how \({y}\) is related to \({x}\) and an error term is called the regression model. For example, simple linear regression model is given by equation (29.1)

\[{y} = {\beta}_0 + {\beta}_1 {x} + {\epsilon} \tag{29.1}\]

Note

  • \({\beta}_0\) and \({\beta}_1\) are referred to as the parameters of the model
  • The random variable, error term \(({\epsilon})\), accounts for the variability in \({y}\) that cannot be explained by the linear relationship between \({x}\) and \({y}\).

29.2 Summary

Validation


30 Multiple Regression

30.1 Overview

    • “ForLater” - Everything

30.2 Summary

Multiple regression analysis enables us to understand how a dependent variable is related to two or more independent variables.

Validation


31 Regression Models

31.1 Overview

    • “ForLater” - Everything

31.2 Summary

Validation


32 Time Series

32.1 Overview

  • “Time Series Analysis and Forecasting”
    • “ForLater” - Everything

32.2 Summary

Validation


33 Nonparametric Methods

33.1 Overview

33.2 Paramtetric Methods

Definition 33.1 Parametric methods are the statistical methods that begin with an assumption about the probability distribution of the population which is often that the population has a normal distribution. A sampling distribution for the test statistic can then be derived and used to make an inference about one or more parameters of the population such as the population mean \({\mu}\) or the population standard deviation \({\sigma}\).

Parametric methods mostly require quantitative data. However these are generally sometimes more powerful than nonparametric methods.

  • The reason that parametric tests are sometimes more powerful than randomisation and tests based on ranks is that the parametric tests make use of some extra information about the data: the nature of the distribution from which the data are assumed to have come.
  • Powerful here means, they require smaller sample size.
  • However, their power advantage is not invariant
  • Further, Rarely if ever a parametric test and a non-parametric test actually have the same null.
    • The parametric t-test is testing the mean of the distribution, assuming the first two moments exist.
    • The Wilcoxon rank sum test does not assume any moments, and tests equality of distributions instead.
    • The two tests are testing different hypotheses (comparable in a limited sense but different).
  • At large sample sizes, either of the parametric or the nonparametric tests work adequately.

33.3 Nonparamtetric Methods

Definition 33.2 Distribution-free methods are the Statistical methods that make no assumption about the probability distribution of the population.
Definition 33.3 Nonparametric methods are the statistical methods that require no assumption about the form of the probability distribution of the population and are often referred to as distribution free methods. Several of the methods can be applied with categorical as well as quantitative data.

Most of the statistical methods referred to as parametric methods require quantitative data, while nonparametric methods allow inferences based on either categorical or quantitative data.

  • However, the computations used in the nonparametric methods are generally done with categorical data.
    • Nominal or ordinal measures in many cases require a nonparametric test.
  • Whenever the data are quantitative, we will transform the data into categorical data in order to conduct the nonparametric test.
  • Most nonparametric tests use some way of ranking the measurements.
  • Nonparametric tests are used in cases where parametric tests are not appropriate.
    • Nonparametric tests are often necessary, specially when the distribution is not normal (skewness), the distribution is not known, or the sample size is too small (<30) to assume a normal distribution.
    • Also, if there are extreme values or values that are clearly “out of range” nonparametric tests should be used.

33.4 Summary

Validation


Quality Control


Index Numbers


34 Introduction to Data Mining

Definitions and Exercises are from the Book (Daniel T. Larose 2015)

34.1 Overview

  • “An Introduction to Data Mining and Predictive Analytics”
Definition 34.1 Data mining is the process of discovering useful patterns and trends in large data sets.
Definition 34.2 Predictive analytics is the process of extracting information from large data sets in order to make predictions and estimates about future outcomes.
  • The Cross-Industry Standard Process for Data Mining (CRISP-DM) (Iterative)
    • Business/Research Understanding Phase
      • Clearly enunciate the project objectives and requirements in terms of the business or research unit as a whole.
      • Translate these goals and restrictions into the formulation of a data mining problem definition.
      • Prepare a preliminary strategy for achieving these objectives.
    • Data Understanding Phase
      • Collect the data.
      • Use Exploratory Data Analysis (EDA) to familiarize yourself with the data, and discover initial insights.
      • Evaluate the quality of the data.
      • if desired, select interesting subsets that may contain actionable patterns.
    • Data Preparation Phase
      • Raw | Select | Filter | Subset | Clean | Transform
    • Modeling Phase
      • Select and apply appropriate modeling techniques
      • Calibrate
    • Evaluation Phase
      • Models must be evaluated for quality and effectiveness.
      • Also, determine whether the model in fact achieves the objectives set for it
      • Establish whether some important facet of the business or research problem has not been sufficiently accounted for.
    • Deployment Phase / Report / Publish
  • Most common data mining tasks
    • Descriptions of patterns and trends often suggest possible explanations for existence of theme within the data.
    • In estimation, we approximate the value of a numeric target variable using a set of numeric and/or categorical predictor variables
      • Estimation Methods: Point Estimation, Confidence Interval Estimation, Simple Linear Regression, Correlation, Multiple Regression
    • Prediction is similar to classification and estimation, except that for prediction, the results lie in the future.
      • Estimation Methods given above can also be used for prediction
      • Methods: k-nearest neighbor methods, decision trees, and neural networks.
    • Classification is similar to estimation, except that the target variable is categorical rather than numeric.
    • Clustering refers to the grouping of records, observations, or cases into classes of similar objects.
      • A cluster is a collection of records that are similar to one another, and dissimilar to records in other clusters.
      • Clustering differs from classification in that there is no target variable for clustering. The clustering task does not try to classify, estimate, or predict the value of a target variable. Instead, clustering algorithms seek to segment the whole data set into relatively homogeneous subgroups or clusters, where the similarity of the records within the cluster is maximized, and the similarity to records outside of this cluster is minimized.
      • Example: All zipcodes of a country can be described in terms of distinct lifestyle types (e.g. 66 types). One of them “Upper Crust” might be the wealthiest lifestyle of the country i.e. couples between ages of 45-64 without any dependents (children, parents) living at their home. This segment might be having median earnings of ~1-million and might be possessing a postgraduate degree. No other lifestyle type would have a more opulent standard of living.
      • Methods: Hierarchical and k-means clustering, Kohonen networks, BIRCH clustering
    • Association seeks to uncover rules for quantifying the relationship between two or more attributes.
      • Also known as: affinity analysis or market basket analysis
      • Association rules are of the form “If antecedent then consequent,” together with a measure of the support and confidence associated with the rule.
      • Methods: a priori algorithm, GRI algorithm

Validation


35 Data Processing

35.1 Data

Refer Lecture: Data Pre-Processing Refer Numerical Measures

Please import the "B18-Churn.xlsx". Please import the "B16-Cars2.csv".

  • Caution: The Cars2 dataset has 263 observations, whereas the Cars dataset has 261.

35.2 Flag Variables

Definition 35.1 A flag variable (or dummy variable, or indicator variable) is a categorical variable taking only two values, 0 and 1. Ex: Gender (Male, Female) can be recoded into dummy Gender (Male = 0, Female = 1).
  • When a categorical predictor takes \(k \geq 3\) possible values, then define \(k - 1\) dummy variables, and use the unassigned category as the reference category.
    • Ex: Region = North, East, South, West i.e. k = 4
      • 3 flags : flag_north, flag_east, flag_south; each will be 1 for their own region; 0 otherwise
      • The flag variable for the west is not needed, as ‘region = west’ is already uniquely identified by zero values for each of the three existing flag variables.
      • Instead, the unassigned category becomes the reference category, meaning that, the interpretation of the value of north_flag is ‘region = north’ compared to ‘region = west.’
        • For example, if we are running a regression analysis with income as the target variable, and the regression coefficient for north_flag equals 1000, then the estimated income for ‘region=north’ is 1000 greater than for ‘region=west,’ when all other predictors are held constant.
      • Further, inclusion of the fourth flag variable will cause some algorithms to fail, because of the singularity of the \((X^{T}X)^{-1}\) matrix in regression, for instance.

35.3 Transforming Categorical to Numerical

  • Ex: Region = North, East, South, West can be assigned numbers 1 to 4
    • It may result in algorithm treating them as continuous and /or ordered values
  • Generally, categorical varaiables should not be transformed into numerical, except when these are clearly ordered.
    • e.g. Survey Response is an ordered categorical variable and can be assigned values 1 to 5

35.4 Binning Continuous to Categorical

  • Ex: Income into low, medium, high
    • Equal width binning divides the numerical predictor into k categories of equal width
      • NOT recommended for most applications because it can be greatly affected by the outliers
    • Equal frequency binning divides the numerical predictor into k categories, each having k/n records, where n is the total number of records.
      • Simple. However, it has problem that sometimes same value can be found in two consecutive groups
      • Equal data values must belong to same category.
    • Binning by clustering uses a clustering algorithm, such as k-means clustering to automatically calculate the “optimal” partitioning.
    • Binning based on predictive value: Above methods ignore the target variable; binning based on predictive value partitions the numerical predictor based on the effect each partition has on the value of the target variable.
ERROR 35.1 Error: Insufficient data values to produce ... bins.
  • summary(cut_number(diamonds$depth, n = 27)) Passed
  • summary(cut_number(diamonds$depth, n = 28)) Failed (N = 53940)
  • ggplot2::cut_number() has some internal logic about size of the bins. It is not as simple as total size, it also has to do with relative size.
  • It also fails when Bins have overlap.
  • Use dplyr::ntile()
  • OR Pick a bin size that works for your data.
# #n = 12
bb <- c(1, 1, 1, 1, 1, 2, 2, 11, 11, 12, 12, 44)
#
# #Fixing Number of Bins: Unequal number of Observations and also Bins with 0 Observations
summary(cut(bb, breaks = 3))
## (0.957,15.3]  (15.3,29.7]    (29.7,44] 
##           11            0            1
summary(cut_interval(bb, n = 3))
##    [1,15.3] (15.3,29.7]   (29.7,44] 
##          11           0           1
#
# #For reference, NOT equivalent to above. 
summary(cut_width(bb, width = 15))
##  [-7.5,7.5]  (7.5,22.5] (22.5,37.5] (37.5,52.5] 
##           7           4           0           1
#
# #Using Equal Frequency: Same Observation may belong to different consecutive Bins
if(FALSE) ggplot2::cut_number(bb, n = 3) #ERROR
if(FALSE) { #Works
  #ceiling(seq_along(bb)/4)[rank(bb, ties.method = "first")] 
  tibble(bb = bb, RANK = rank(bb, ties.method = "first"), 
         ALONG = seq_along(bb), CEIL = ceiling(seq_along(bb)/4)) 
}
#
# #dplyr::ntile() can be used in place of ggplot2::cut_number()
dplyr::ntile(bb, n = 3)
##  [1] 1 1 1 1 2 2 2 2 3 3 3 3
#

35.5 Adding an Index Field (ID)

Caution: ID fields should be filtered out from the data mining algorithms, but should not be removed from the data. These are for easy identification of records not for correlation.

mtcars %>% mutate(ID = row_number()) %>% relocate(ID) %>% slice(1:6L)
##                   ID  mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4          1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag      2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710         3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive     4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout  5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant            6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

35.6 Variable that should not be removed (probably)

  • Ex: An example of correlated variables may be precipitation and attendance at a state beach.
    • As precipitation increases, attendance at the beach tends to decrease, so that the variables are negatively correlated.
  • Inclusion of correlated variables may at best double-count a particular aspect of the analysis, and at worst lead to instability of the model results.
  • Thus, we may decide to simply remove one of the variables.
  • However, it should not be done, as important information may thereby be discarded.
  • Instead, it is suggested that PCA be applied, where the common variability in correlated predictors may be translated into a set of uncorrelated principal components.

Validation


36 EDA

36.1 Overview

36.2 Churn

Please import the "B18-Churn.xlsx".

  • Churn (attrition)
    • It is a term used to indicate a customer leaving the service of one company in favor of another company.
    • The data set contains 20 predictors worth of information about 3333 customers, along with the target variable, churn, an indication of whether that customer churned (left the company) or not.
  • Description [3333, 21]
    • State: Categorical, for the 50 states and the District of Columbia.
    • Account length: Integer-valued, how long account has been active.
    • Area code: Categorical
    • Phone number: Essentially a surrogate for customer ID.
    • International plan: Dichotomous categorical, yes or no.
    • Voice mail plan: Dichotomous categorical, yes or no.
    • Number of voice mail messages: Integer-valued.
    • Total day minutes: Continuous, minutes customer used service during the day.
    • Total day calls: Integer-valued.
    • Total day charge: Continuous, perhaps based on above two variables.
    • Total eve minutes: Continuous, minutes customer used service during the evening.
    • Total eve calls: Integer-valued.
    • Total eve charge: Continuous, perhaps based on above two variables.
    • Total night minutes: Continuous, minutes customer used service during the night.
    • Total night calls: Integer-valued.
    • Total night charge: Continuous, perhaps based on above two variables.
    • Total international minutes: Continuous, minutes customer used service to make international calls.
    • Total international calls: Integer-valued.
    • Total international charge: Continuous, perhaps based on above two variables.
    • Number of calls to customer service: Integer-valued.
    • Churn: Target. Indicator of whether the customer has left the company (True or False)

36.3 Basics, Skewness & Normality

Table 36.1: (C33T01) Churn: Normality
Key Min Max SD Mean Median Mode Unique isNA Skewness p_Shapiro isNormal
day_mins 0 350.8 54.467 179.78 179.4 154 1667 0 -0.0291 0.6401 TRUE
day_calls 0 165 20.069 100.44 101 102 119 0 -0.1117 0.0003 FALSE
day_charge 0 59.6 9.259 30.56 30.5 26.18 1667 0 -0.0291 0.6401 TRUE
eve_mins 0 363.7 50.714 200.98 201.4 169.9 1611 0 -0.0239 0.7125 TRUE
eve_calls 0 170 19.923 100.11 100 105 123 0 -0.0555 0.0088 FALSE
eve_charge 0 30.9 4.311 17.08 17.12 16.12 1440 0 -0.0238 0.7091 TRUE
night_mins 23.2 395 50.574 200.87 201.2 188.2 1591 0 0.0089 0.627 TRUE
night_calls 33 175 19.569 100.11 100 105 120 0 0.0325 0.2514 TRUE
night_charge 1.04 17.8 2.276 9.04 9.05 9.66 933 0 0.0089 0.6238 TRUE
intl_mins 0 20 2.792 10.24 10.3 10 162 0 -0.2449 0 FALSE
intl_calls 0 20 2.461 4.48 4 3 21 0 1.3203 0 FALSE
intl_charge 0 5.4 0.754 2.76 2.78 2.7 162 0 -0.2451 0 FALSE
custserv_calls 0 9 1.315 1.56 1 1 10 0 1.0904 0 FALSE
vmail_message 0 51 13.688 8.1 0 0 46 0 1.2637 0 FALSE
account_length 1 243 39.822 101.06 101 105 212 0 0.0965 0.0012 FALSE

36.4 Explore Categorical Variables

36.4.1 International Plan

Contingency Table (CrossTab)

  • 28.4% of churners belong to the International Plan, compared to 6.5% of non-churners.
  • 42.4% of the International Plan holders churned, as compared to only 11.5% of those without the International Plan.
  • The proportion of International Plan holders is greater among the churners.
  • Summary
    • We should investigate what is it about our international plan that is inducing our customers to leave
    • We should expect that, the model (in future) will probably include whether or not the customer selected the International Plan.
Table 36.2: (C33T02) International vs. Churn
Churn \(\rightarrow\)
\(\downarrow\) International
No

Yes

Row
SUM
Row
No %
Row
Yes %
Col
No %
Col
Yes %
Col
Row SUM %
No 2664 346 3010 88.5% 11.5% 93.5% 71.6% 90.3%
Yes 186 137 323 57.6% 42.4% 6.5% 28.4% 9.7%
Total 2850 483 3333 85.5% 14.5% 100% 100% 100%

Bar Charts

  • Bar Charts
    • Grouped
      • These are good for comparing between each element in the categories, and comparing elements across categories.
    • Stacked
      • They are great for showing the total because they visually aggregate all of the categories in a group.
      • The downside is that it becomes harder to compare the sizes of the individual categories.
      • Stacking also indicates a part to whole relationship.
    • Stacked Percent
      • The total quantity is hidden by using percentages, but it iss easier to see the relative difference between quantities in each group.
    • Summary
      • If there is no part to whole relationship (maybe there is overlap in the categories), then grouped
      • If there is a part to whole relationship, then the next question to ask is what relationship is the most important to show.
        • If the goal is to show sizes between individual categories, use a grouped column or bar chart.
        • If the goal is to show the total sizes of groups, use a regular stacked bar chart.
        • If the goal is to show relative differences within each group, use a stacked percentage column chart.
ERROR 36.1 Error: stat_count() can only have an x or y aesthetic.
ERROR 36.2 Error: stat_count() must not be used with a y aesthetic.
  • stat_count() Based on the data, choose stat
    • stat = ‘identity’ tells ggplot2 that you will provide the y-values (i.e. Wide Data)
    • stat = ‘count’ is the default, which implies that ggplot2 will count the aggregate number of rows for each x value. (i.e. Long Data)
Image
International Plan holders tend to churn more frequentlyInternational Plan holders tend to churn more frequentlyInternational Plan holders tend to churn more frequently

Figure 36.1 International Plan holders tend to churn more frequently

Code CrossTab
# #xyw: x (independent), y (dependent), c (column =y), r (row =x)
# #g (grouped), w (wider), l (longer), 
r_xyg <- "International"
c_xyg <- "Churn"
xyg <- bb %>% select(int_l_plan, churn) %>% rename(Predictor = 1, Target = 2) %>% 
  mutate(across(1:2, ~ifelse(., 'Yes', 'No'))) %>% 
  count(Predictor, Target) 
#
str(xyg)
# #IN: xyg (Predictor, Target), r_xyg, c_xyg 
# #Generate Contingency Table (CrossTab)
ctab <- xyg %>% 
  pivot_wider(names_from = Target, values_from = n, values_fill = 0, names_sort = TRUE) %>%
  #mutate(across(1, as.character)) %>% 
  add_row(summarise(., across(1, ~ "Total")), summarise(., across(where(is.numeric), sum))) %>% 
  mutate(SUM = rowSums(across(where(is.numeric)))) %>% 
  mutate(across(where(is.numeric) & !matches("SUM"), 
                list(r = ~ round(. /SUM, 3)), .names = "xx_{.fn}{.col}")) %>% 
  mutate(across(where(is.numeric) & !starts_with("xx_") & !matches("SUM"), 
                list(rp = ~ paste0(round(100 * . /SUM, 1), "%")), .names = "{.fn}{.col}")) %>%
  mutate(across(where(is.numeric) & !starts_with("xx_"), 
                list(c = ~ 2 * ./sum(.)), .names = "yy_{.fn}{.col}")) %>% 
  mutate(across(where(is.numeric) & !starts_with("xx_") & !starts_with("yy_"), 
                list(cp = ~ paste0(round(100 * 2 * . /sum(.), 1), "%")), .names = "{.fn}{.col}"))
Code CrossTab Print
hh <- ctab %>% select(-c(5:6, 9:11)) 
#names_hh <- names(hh)
names_hh <-c(paste0(c_xyg, " ", "$\\rightarrow$", "<br/>", "$\\downarrow$", " ", r_xyg), 
                "No <br/> <br/>", "Yes <br/> <br/>", "Row <br/> SUM", 
                "Row <br/> No %", "Row <br/> Yes %", 
                "Col <br/> No %", "Col <br/> Yes %", "Col <br/> Row SUM %") 
stopifnot(identical(ncol(hh), length(names_hh)))
stopifnot(nrow(hh) < 10)
#
cap_hh <- paste0("(C33T02) ", r_xyg, " vs. ", c_xyg)
#
kbl(hh,
  caption = cap_hh,
  col.names = names_hh,
  escape = FALSE, align = "c", booktabs = TRUE
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                html_font = "Consolas", font_size = 12,
                full_width = FALSE,
                #position = "float_left",
                fixed_thead = TRUE
  ) %>%
# #Header Row Dark & Bold: RGB (48, 48, 48) =HEX (#303030)
    row_spec(0, color = "white", background = "#303030", bold = TRUE,
             extra_css = "border-bottom: 1px solid; border-top: 1px solid"
    ) #%>% row_spec(row = 1:nrow(hh), color = "black")

36.4.2 Voice Mail Plan

Contingency Table (CrossTab)

  • 16.7% of those without the Voice Mail Plan are churners, as compared to 8.7% of customers who do have the Voice Mail Plan.
    • Thus, customers without the Voice Mail Plan are nearly twice as likely to churn as customers with the plan.
  • Summary
    • Perhaps we should enhance our Voice Mail Plan still further, or make it easier for customers to join it, as an instrument for increasing customer loyalty
    • We should expect that, the model (in future) will probably include whether or not the customer selected the Voice Mail Plan. Our confidence in this expectation is perhaps not quite as high as for the International Plan
Table 36.3: (C33T03) VoiceMail vs. Churn
Churn \(\rightarrow\)
\(\downarrow\) VoiceMail
No

Yes

Row
SUM
Row
No %
Row
Yes %
Col
No %
Col
Yes %
Col
Row SUM %
No 2008 403 2411 83.3% 16.7% 70.5% 83.4% 72.3%
Yes 842 80 922 91.3% 8.7% 29.5% 16.6% 27.7%
Total 2850 483 3333 85.5% 14.5% 100% 100% 100%

36.4.3 Three Categorical Variables

  • Two-way interactions among categorical variables with respect to churn.
    • Multilayer clustered bar chart of churn, clustered by both International Plan and Voice Mail Plan.
      • There are many more customers who have neither plan.
      • There are many more customers who have the voice mail plan only than have both plans.
      • More importantly, among customers with no voice mail plan, the proportion of churners is greater for those who do have an international plan (43.7%) than for those who do not (13.9%).
      • Again, however, among customers with the voice mail plan, the proportion of churners is much greater for those who also select the international plan (39.1%) than for those who do not (5.3%).
      • Note also that there is no interaction among the categorical variables. That is, international plan holders have greater churn regardless of whether they are Voice Mail plan adopters or not.

Image

Churn vs. International & Voice Mail Plans (Both are Same Graphs)Churn vs. International & Voice Mail Plans (Both are Same Graphs)

Figure 36.2 Churn vs. International & Voice Mail Plans (Both are Same Graphs)

Churn vs. International & Voice Mail Plans (Proportional Stack)

Figure 36.3 Churn vs. International & Voice Mail Plans (Proportional Stack)

Code

xsyg <- bb %>% select(int_l_plan, vmail_plan, churn) %>% 
  rename(Intl = 1, Vmail =2, Churn = 3) %>% 
  mutate(across(Churn, ~ifelse(., "Yes", "No"))) %>% 
  mutate(across(Intl, ~ifelse(., "I: Yes", "I: No"))) %>% 
  mutate(across(Vmail, ~ifelse(., "V: Yes", "V: No"))) %>% 
  count(Intl, Vmail, Churn) %>% rename(N = n)
# #Multilayer Clustered Bar Chart
hh <- xsyg
#
cap_hh <- "C33P04"
ttl_hh <- "Churn: Churn vs. International & Voice Mail Plans"
sub_hh <- NULL 
x_hh <- "Churn" #r_xyg
y_hh <- "Frequency"
lgd_hh  <- "Churn" #c_xyg
#
C33 <- hh %>% { ggplot(., aes(x = Churn, y = N, fill = Churn)) + 
    geom_bar(position = "dodge", stat = "identity", alpha = 1) + 
    geom_text(position = position_stack(vjust = 0.5), aes(label = N)) +
    facet_wrap(Intl ~ Vmail, nrow = 1) +
    scale_fill_manual(values = c('#FFEA46FF', '#787877FF')) +
    theme(panel.grid.major.x = element_blank(), axis.line = element_blank(),
          panel.border = element_rect(colour = "black", fill = NA, size = 1),
          legend.position = 'top', 
          legend.box = "horizontal", legend.direction = "horizontal") +
    labs(x = x_hh, y = y_hh, fill = lgd_hh, 
         subtitle = sub_hh, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)
# #Multilayer Clustered Bar Chart
hh <- xsyg
#
cap_hh <- "C33P05"
ttl_hh <- "Churn: Churn vs. International & Voice Mail Plans"
sub_hh <- NULL 
x_hh <- NULL #"Churn" #r_xyg
y_hh <- "Frequency"
lgd_hh  <- "Churn" #c_xyg
#
C33 <- hh %>% { ggplot(., aes(x = Vmail, y = N, fill = Churn)) + 
    geom_bar(position = "dodge", stat = "identity", alpha = 1) + 
    geom_text(position = position_dodge(width = 1), aes(label = N), vjust = 1.5) +
    facet_wrap(~Intl, nrow = 1) +
    scale_fill_manual(values = c('#FFEA46FF', '#787877FF')) +
    theme(panel.grid.major.x = element_blank(), axis.line = element_blank(),
          panel.border = element_rect(colour = "black", fill = NA, size = 1),
          legend.position = 'top', 
          legend.box = "horizontal", legend.direction = "horizontal") +
    labs(x = NULL, y = y_hh, fill = lgd_hh, 
         subtitle = sub_hh, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)

Code More

# #Percent Stacked Bar Chart
hh <- xsyg %>% group_by(Vmail, Intl) %>% 
    mutate(Ratio = paste0(round(100 * N/sum(N), 1), "%")) %>% ungroup()
#
cap_hh <- "C33P06"
ttl_hh <- "Churn: Churn vs. International & Voice Mail Plans"
sub_hh <- NULL 
x_hh <- "Voice Mail"
y_hh <- "Grouped Percentage"
lgd_hh  <- "Churn"
#
C33 <- hh %>% { ggplot(., aes(x = Vmail, y = N, fill = Churn)) + 
        geom_bar(position = "fill", stat = 'identity') + 
        facet_wrap(~Intl, nrow = 1) +
        geom_text(position = position_fill(vjust = 0.5), aes(label = Ratio), 
              colour = rep(c("black", "white"), 4)) +
        scale_fill_viridis_d(direction = -1) +  
        scale_y_continuous(labels = percent) +
        theme(panel.grid.major.x = element_blank(), axis.line = element_blank(),
              panel.border = element_rect(colour = "black", fill = NA, size = 1)) +
        labs(x = x_hh, y = y_hh, fill = lgd_hh, 
            subtitle = sub_hh, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)

36.5 Exploring Numeric Variables

  • Refer figure 36.4 & Table 36.1
    • Fields not showing evidence of symmetry include voice mail messages and customer service calls.
    • Voice Mail Messages: Median = 0
      • Indicating that at least half of all customers had no voice mail messages.
      • Because fewer than half of the customers select the Voice Mail Plan (27.7%).
    • Customer Service Calls: Mean = 1.56 > Median = 1
      • It indicates some right-skewness
      • Also indicated by the maximum number of customer service calls (9).
Churn: All Histograms

Figure 36.4 Churn: All Histograms

36.5.1 Service Calls

  • Overlay Histogram
    • To explore whether a predictor is useful for predicting the target variable
    • Bars are colored according to the values of the target variable.
    • Normalised Histogram
      • Proportions stretched out to enable better contrast
        • Normalized histograms are useful for teasing out the relationship between a numerical predictor and the target. However, data analysts should always provide the companion a non-normalized histogram along with the normalized histogram, because the normalized histogram does not provide any information on the frequency distribution of the variable.
        • Ex: The churn rate for customers logging nine service calls is 100%; but there are only two customers with this number of calls.
      • Customers who have called customer service three times or less have a markedly lower churn rate than customers who have called customer service four or more times.
  • Summary
    • By the third service call, specialized incentives should be offered to retain customer loyalty, because, by the fourth call, the probability of churn increases greatly
    • We should expect that, the model (in future) will probably include the number of customer service calls made by the customer.

Image

Service Calls beyond 3 have signficant increase in ChurnService Calls beyond 3 have signficant increase in ChurnService Calls beyond 3 have signficant increase in Churn

Figure 36.5 Service Calls beyond 3 have signficant increase in Churn

Churn vs. Service Call Proportions (Bar)

Figure 36.6 Churn vs. Service Call Proportions (Bar)

Code

ii <- bb %>% select(custserv_calls, churn) %>% 
  rename(Churn = 2) %>% 
  mutate(across(Churn, ~ifelse(., "Yes", "No")))
# #Histogram Default (Not useful for association of predictor and target)
hh <- ii
ttl_hh <- "Churn: Customer Service Calls"
cap_hh <- "C33P08"
sub_hh <- "Predictor Only" 
x_hh <- "x"
y_hh <- "Frequency"
lgd_hh  <- "Churn"
#
C33 <- hh %>% { ggplot(data = ., mapping = aes(x = custserv_calls, fill = '#FDE725FF')) + 
    geom_histogram(bins = length(unique(.[[1]])), alpha = 1) + 
    #stat_bin(bins = length(unique(.[[1]])), aes(y=..count.., label=..count..), 
    #         geom="text", position=position_stack(vjust=0.5)) +
    scale_x_continuous(breaks = breaks_pretty()) + 
    scale_fill_viridis_d(direction = -1) +
    theme(plot.title.position = "panel", 
          axis.title.x = element_blank(), 
          #legend.position = c(0.5, -0.08), legend.direction = 'horizontal', 
          legend.position = 'none') +
    labs(x = x_hh, y = y_hh, #fill = lgd_hh,
         caption = cap_hh, subtitle = sub_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)
# #Histogram (Predictor and Target Count)
hh <- ii
ttl_hh <- "Churn: Customer Service Calls & Churn"
cap_hh <- "C33P09"
sub_hh <- "Count of Predictor & Target" 
x_hh <- "x"
y_hh <- "Frequency"
lgd_hh  <- "Churn"
#
C33 <- hh %>% { ggplot(data = ., mapping = aes(x = custserv_calls, fill = Churn)) + 
    geom_histogram(bins = length(unique(.[[1]])), alpha = 1) + 
    #stat_bin(bins = length(unique(.[[1]])), aes(y=..count.., label=..count..), 
    #         geom="text", position=position_stack(vjust=0.5)) +
    scale_x_continuous(breaks = breaks_pretty()) + 
    scale_fill_viridis_d(direction = -1) +
    theme(plot.title.position = "panel", 
          axis.title.x = element_blank(), 
          legend.position = c(0.5, -0.07), legend.direction = 'horizontal') +
    labs(x = x_hh, y = y_hh, fill = lgd_hh,
         caption = cap_hh, subtitle = sub_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)
# #Histogram (Predictor and Target Proportion)
hh <- ii
ttl_hh <- "Churn: Customer Service Calls & Churn"
cap_hh <- "C33P10"
sub_hh <- "Proportion of Predictor & Target (Using Histogram)" 
x_hh <- "x"
y_hh <- "Grouped Percentage"
lgd_hh  <- "Churn"
#
# #NOTES: Group wise Proportion i.e. Yes 1, No 1 (Not Each Bin wise)
# #Adding Density automatically converts the y-axis to 'frequency density' not to 'percentage'
# #Both will match when Bin Width =1 but otherwise there will be a mismatch
# #Using y=..density.. scales the histograms so the area under each is 1, or sum(binwidth*y)=1. 
# #So use y = binwidth *..density.. to have y represent the fraction of the total in each bin. 
# #OR aes(y = stat(width*density))
# #OR aes(y = stat(count / sum(count)))
# #Clarification
# # ..count../sum(..count..) each count is divided by the total count
# # ..density.. it is applied to each group independently
#
C33 <- hh %>% { ggplot(data = ., mapping = aes(x = custserv_calls, fill = Churn)) + 
    geom_histogram(bins = length(unique(.[[1]])), alpha = 1, position = 'fill') + 
    #stat_bin(bins = length(unique(.[[1]])), 
    #         aes(y = c(..count..[..group..==1]/sum(..count..[..group..==1]),
    #                   ..count..[..group..==2]/sum(..count..[..group..==2])), 
    #             label=round(..density.., 2)), geom="text") +
    #stat_bin(bins = length(unique(.[[1]])), 
    #         aes(y=..density.., group = ..group.., 
    #             label=round(..density.., 2)), geom="text") +
    scale_x_continuous(breaks = breaks_pretty()) + 
        #Deprecated : percent_format()
        scale_y_continuous(labels = percent) +
    scale_fill_viridis_d(direction = -1) +
    theme(plot.title.position = "panel", 
          axis.title.x = element_blank(), 
          legend.position = c(0.5, -0.07), legend.direction = 'horizontal') +
    labs(x = x_hh, y = y_hh, fill = lgd_hh,
         caption = cap_hh, subtitle = sub_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)

Code More

# #complete() can be used to add the missing combination but not using it for now
# #NOTE: Bar is easier to include Labels, However Histogram is easier for large number of bins
xyg  <- ii %>% count(custserv_calls, Churn) 
     #%>% complete(custserv_calls, Churn, fill = list(n = 0)) 
hh <- xyg %>% rename(Group = 1, SubGroup = 2, N = 3) %>% 
  group_by(Group) %>% 
  mutate(Ratio = paste0(round(100 * N/sum(N), 1), "%")) %>% ungroup()
#
ttl_hh <- "Churn: Customer Service Calls & Churn"
cap_hh <- "C33P11"
sub_hh <- "Proportion of Predictor & Target (Using Bar)" 
x_hh <- "x"
y_hh <- "Grouped Percentage"
lgd_hh  <- "Churn"
#
C33 <- hh %>% { ggplot(., aes(x = Group, y = N, fill = SubGroup)) + 
    geom_bar(position = "fill", stat = 'identity') + 
    geom_text(position = position_fill(vjust = 0.5), 
              aes(label = ifelse(SubGroup == 'No', "", Ratio)), 
              colour = c(rep(c("black", "white"), 9), "white")) +
    scale_fill_viridis_d(direction = -1) +
    scale_x_continuous(breaks = breaks_pretty()) +
    scale_y_continuous(labels = percent) +
    theme(plot.title.position = "panel", axis.title.x = element_blank(), 
          panel.grid.major.x = element_blank(), axis.line = element_blank(),
          panel.border = element_rect(colour = "black", fill = NA, size = 1),
          legend.position = c(0.5, -0.07), legend.direction = 'horizontal') +
    labs(x = x_hh, y = y_hh, fill = lgd_hh, 
         subtitle = sub_hh, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)

36.5.2 Day Minutes

  • Summary: Day Minutes
    • High day-users tend to churn at a higher rate.
    • As the number of day minutes passes 200, we should consider special incentives
    • We should investigate why heavy day-users are tempted to leave
    • We should expect that, the model (in future) will probably include ‘day minutes.’
Higher Day Minutes (>200) have higher ChurnHigher Day Minutes (>200) have higher Churn

Figure 36.7 Higher Day Minutes (>200) have higher Churn

36.5.3 Evening Minutes

  • Summary : Evening Minutes
    • It shows a slight tendency for customers with higher evening minutes to churn.
    • However, it is inconclusive solely based on graph.
(Inconclusive) Slight tendency to Churn with higher Evening Minutes(Inconclusive) Slight tendency to Churn with higher Evening Minutes

Figure 36.8 (Inconclusive) Slight tendency to Churn with higher Evening Minutes

36.5.4 Night Minutes

  • Summary : Night Minutes
    • There is no obvious association visible (in graphs) between churn and night minutes
    • The lack of obvious association at the EDA stage between a predictor and a target variable is not sufficient reason to omit that predictor from the model.
    • Unless there is a good reason for eliminating the variable before modeling, we should keep them and allow the modeling process to identify which variables are predictive and which are not.
No obvious association between Churn and Night MinutesNo obvious association between Churn and Night Minutes

Figure 36.9 No obvious association between Churn and Night Minutes

36.5.5 International Calls

  • Summary : International Calls
    • There is no obvious association visible (in graphs) between churn and International Calls
    • However: t-test for the difference in mean number of international calls for churners and non-churners is statistically significant. i.e. Means are Different.
    • NOTE: Hypothesis Testing is NOT part of EDA. It is mentioned here to show that statistically significant association may exist without being obviously visible.

Image

No obvious association between Churn and International Calls (But t-test)No obvious association between Churn and International Calls (But t-test)

Figure 36.10 No obvious association between Churn and International Calls (But t-test)

t-test

# #t-test for difference in Mean of Target Values
str(ii)
## tibble [3,333 x 2] (S3: tbl_df/tbl/data.frame)
##  $ Predictor: int [1:3333] 3 3 5 7 3 6 7 6 4 5 ...
##  $ Target   : chr [1:3333] "No" "No" "No" "No" ...
# 
# #Variance
var_ii <- var.test(formula = Predictor ~ Target, data = ii)
var_ii
## 
##  F test to compare two variances
## 
## data:  Predictor by Target
## F = 0.91594, num df = 2849, denom df = 482, p-value = 0.1971
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.7962798 1.0465337
## sample estimates:
## ratio of variances 
##          0.9159437
#
isVarEqual <- ifelse(var_ii$p.value > 0.05, TRUE, FALSE)
if(isVarEqual) print("Variances are Equal.") else print("Variances are Different.")
## [1] "Variances are Equal."
#
# #t-test: Welch 
ha_ii <- "two.sided" #"two.sided", "less", "greater"
#tt_ii <- t.test(Predictor ~ Target, data = ii, alternative = ha_ii, var.equal = isVarEqual)
tt_ii <- t.test(Predictor ~ Target, data = ii, alternative = ha_ii, var.equal = FALSE)
tt_ii
## 
##  Welch Two Sample t-test
## 
## data:  Predictor by Target
## t = 2.9604, df = 640.64, p-value = 0.003186
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
##  0.1243807 0.6144620
## sample estimates:
##  mean in group No mean in group Yes 
##          4.532982          4.163561
#
alpha <- 0.05
if(any(all(ha_ii == "two.sided", tt_ii$p.value >= alpha / 2), 
       all(ha_ii != "two.sided", tt_ii$p.value >= alpha))) {
    print("Failed to reject H0.")
} else { 
    print("H0 Rejected.")
}
## [1] "H0 Rejected."

36.5.6 All Histograms

Images

Histograms of All Predictors with Target (Count)

Figure 36.11 Histograms of All Predictors with Target (Count)

Histograms of All Predictors with Target (Proportion)

Figure 36.12 Histograms of All Predictors with Target (Proportion)

Code

# #All Continuous Predictors and Target (Categorical)
xsy <- bb %>% 
  select(where(is.numeric) | "churn") %>% 
  select(!area_code) %>% 
  relocate(ends_with("_mins")) %>% 
  relocate(ends_with("_calls")) %>% 
  relocate(vmail_message, .after =  last_col()) %>% 
  relocate("churn") %>% rename("Target" = 1) %>% 
  mutate(across(Target, ~ifelse(., "Yes", "No"))) %>% 
  mutate(across(Target, factor, levels = unique(Target)))
#
xsyl <- xsy %>% pivot_longer(where(is.numeric), names_to = "Predictors", values_to = "Values") %>% 
  mutate(across(Predictors, ~ factor(., levels = unique(Predictors))))
#
#str(ii)
# #Histogram
hh <- xsyl
ttl_hh <- "Churn: Histograms of All Predictors with Target (Count)"
cap_hh <- "C33P18"
sub_hh <- NULL #"Count of Predictor & Target" 
x_hh <- NULL # "x"
y_hh <- NULL # "Frequency" #"Grouped Percentage"
lgd_hh  <- "Churn"
#
C33 <- hh %>% { ggplot(data = ., mapping = aes(x = Values, fill = Target)) + 
    geom_histogram(alpha = 1, boundary = 0, #position = position_stack(reverse = TRUE),
        bins = ifelse(length(unique(.$Predictors)) > 50, 50, length(unique(.$Predictors)))) + 
    #geom_histogram(alpha = 1, boundary = 0,
    #    bins = ifelse(nrow(distinct(.[2])) > 50, 50, nrow(distinct(.[2])))) + 
    facet_wrap(~Predictors, nrow = 3, scales = 'free') +
    scale_x_continuous(breaks = breaks_pretty()) + 
    scale_fill_viridis_d(direction = -1) +
    theme(plot.title.position = "panel", 
          strip.text.x = element_text(size = 10, colour = "white"), 
          axis.title.x = element_blank(), 
          legend.position = c(0.5, -0.07), legend.direction = 'horizontal') +
    labs(x = x_hh, y = y_hh, fill = lgd_hh, 
         caption = cap_hh, subtitle = sub_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)
# #Histogram
hh <- xsyl
ttl_hh <- "Churn: Histograms of All Predictors with Target (Proportion)"
cap_hh <- "C33P19"
sub_hh <- NULL #"Count of Predictor & Target" 
x_hh <- NULL # "x"
y_hh <- NULL # "Frequency" #"Grouped Percentage"
lgd_hh  <- "Churn"
#
# #Caution: Warning on Removal of Missing Values has been removed
# #NOTE: position_fill() normalizes the Bars and is same as 'fill'
# # 
C33 <- hh %>% { ggplot(data = ., mapping = aes(x = Values, fill = Target)) + 
    geom_histogram(alpha = 1, boundary = 0, position = position_fill(), na.rm = TRUE, 
        bins = ifelse(length(unique(.$Predictors)) > 50, 50, length(unique(.$Predictors)))) + 
    #geom_histogram(alpha = 1, boundary = 0,
    #    bins = ifelse(nrow(distinct(.[2])) > 50, 50, nrow(distinct(.[2])))) + 
    facet_wrap(~Predictors, nrow = 3, scales = 'free_x') +
    scale_x_continuous(breaks = breaks_pretty()) + 
    scale_fill_viridis_d(direction = -1) +
    theme(plot.title.position = "panel", 
          strip.text.x = element_text(size = 10, colour = "white"), 
          axis.title.x = element_blank(), 
          legend.position = c(0.5, -0.07), legend.direction = 'horizontal') +
    labs(x = x_hh, y = y_hh, fill = lgd_hh, 
         caption = cap_hh, subtitle = sub_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)

36.6 Exploring Multivariate Variables : Scatter plots

36.6.1 Day minutes & Evenings minutes

  • Summary:
    • Customers with both high day minutes and high evening minutes, appear to have a higher proportion of churners than records below the line.

Image

Scatterplot of Evening Minutes (X) and Day Minutes (Y) shows a clear separation for Churn at High X and High Y

Figure 36.13 Scatterplot of Evening Minutes (X) and Day Minutes (Y) shows a clear separation for Churn at High X and High Y

Code

ii <- bb %>% select(day_mins, eve_mins, churn)  %>% 
  rename(Target = churn) %>% 
  mutate(across(Target, ~ifelse(., "Yes", "No")))
hh <- ii
ttl_hh <- "Churn: Scatterplot of Evening Minutes and Day Minutes"
cap_hh <- "C33P22"
sub_hh <- NULL #subtitle = TeX(r"(Trendline Equation, $R^{2}$, $\bar{x}$ and $\bar{y}$)")
x_hh <- "Evening Minutes" # "x"
y_hh <- "Day Minutes" # "Frequency" #"Grouped Percentage"
lgd_hh  <- "Churn"
#
C33 <- hh %>% { ggplot(data = ., aes(x = eve_mins, y = day_mins)) + 
    #geom_smooth(method = 'lm', formula = k_gg_formula, se = FALSE) +
    #geom_point(position = "jitter", aes(colour = Target)) +
    geom_jitter(aes(colour = Target)) +
        scale_colour_viridis_d(alpha = 0.9, direction = -1) +
    theme(panel.grid.minor = element_blank(),
          panel.border = element_blank()) +
    labs(x = x_hh, y = y_hh, colour = lgd_hh,
         caption = cap_hh, subtitle = sub_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)

36.6.2 Day minutes & Customer Service Calls

  • Summary:
    • It indicates a high-churn area in the upper left section of the graph. These records represent customers who have a combination of a high number of customer service calls and a low number of day minutes used.
    • Note that this group of customers could not have been identified had we restricted ourselves to univariate exploration (exploring variable by single variable). This is because of the interaction between the variables.
    • In general, customers with higher numbers of customer service calls tend to churn at a higher rate, as we learned earlier in the univariate analysis. However, of these customers with high numbers of customer service calls, those who also have high day minutes are somewhat “protected” from this high churn rate. The customers in the upper right of the scatter plot exhibit a lower churn rate than those in the upper left.

Image

Scatterplot of Day Minutes (X) and Customer Service Calls (Y) shows an interaction effect for Churn

Figure 36.14 Scatterplot of Day Minutes (X) and Customer Service Calls (Y) shows an interaction effect for Churn

Code

ii <- bb %>% select(day_mins, custserv_calls, churn)  %>% 
  rename(Target = churn) %>% 
  mutate(across(Target, ~ifelse(., "Yes", "No")))
hh <- ii
ttl_hh <- "Churn: Day Minutes and Customer Service Calls"
cap_hh <- "C33P27"
sub_hh <- NULL
x_hh <- "Day Minutes"
y_hh <- "Customer Service Calls"
lgd_hh  <- "Churn"
#
C33 <- hh %>% { ggplot(data = ., aes(x = day_mins, y = custserv_calls)) + 
    geom_jitter(aes(colour = Target), width = 0.1, height = 0.1) +
    scale_colour_viridis_d(alpha = 0.9, direction = -1) +
    scale_y_continuous(breaks = breaks_pretty()) + 
    theme(panel.grid.minor = element_blank(),
          panel.border = element_blank()) +
    labs(x = x_hh, y = y_hh, colour = lgd_hh,
         caption = cap_hh, subtitle = sub_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)

36.6.3 package:GGally

SPLOM

  • Scatter Plot of Matrices (SPLOM), with bivariate scatter plots below the diagonal, histograms on the diagonal, and the Pearson correlation above the diagonal.
Bivariate Analysis of Calls showing Churn

Figure 36.15 Bivariate Analysis of Calls showing Churn

Bivariate Analysis of Minutes & Charges showing Churn

Figure 36.16 Bivariate Analysis of Minutes & Charges showing Churn

GGally Bivariate

# #Assumes Column 1 has Target Variable (Factor)
C33 <- hh %>% { 
  ggpairs(data = ., mapping = aes(colour = Target, fill = Target, alpha = 0.3), 
          columns = 2:ncol(.), 
          lower = list(continuous = f_gg_scatter),
          diag = list(continuous = f_gg_density)) +
    labs(caption = cap_hh, subtitle = sub_hh, title = ttl_hh) 
}
assign(cap_hh, C33)
rm(C33)

GGally Manual

# #For GGally Manual Functions
f_gg_scatter <- function(data, mapping, ...) {
  ggplot(data = data, mapping = mapping) +
    geom_jitter(...) +
    scale_colour_viridis_d(direction = -1)
}

f_gg_density  <- function(data, mapping, ...) {
  ggplot(data = data, mapping = mapping) +
    geom_density(...) +
    scale_fill_viridis_d(direction = -1) + 
      scale_colour_viridis_d(direction = -1) 
}

36.6.4 package:psych

SPLOM

Bivariate Analysis of Calls showing Churn

Figure 36.17 Bivariate Analysis of Calls showing Churn

Bivariate Analysis of Minutes & Charges showing Churn

Figure 36.18 Bivariate Analysis of Minutes & Charges showing Churn

pairs.panels()

# #IN: hh, cap_hh, ttl_hh, loc_png
if(!file.exists(loc_png)) {
  png(filename = loc_png, width = k_width, height = k_height, units = "in", res = 144) 
  pairs.panels(hh[2:ncol(hh)], smooth = FALSE, jiggle = TRUE, rug = FALSE, ellipses = FALSE, 
               bg = rev(viridis(2))[hh$Target], pch = 21, lwd = 1, cex.cor = 1, cex = 1, 
               gap = 0, main = ttl_hh)
  #title(main = ttl_hh, line = 2, adj = 0)
  title(sub = cap_hh, line = 4, adj = 1)
  C33 <- recordPlot()
  dev.off()
  assign(cap_hh, C33)
  rm(C33)
}

36.7 Subset Interesting Datapoints

  • Those which were identified by EDA
  • Anomalous fields like Area Code which have only 3 values yet it is distributed over all the States

36.8 Binning

  • Figure 36.6 showed that customers with less than four calls to customer service had a lower churn rate than customers who had four or more calls to customer service.
    • We may therefore decide to bin the customer service calls variable into two classes, low (fewer than four) and high (four or more).
High Customer Service Calls have major ChurnHigh Customer Service Calls have major ChurnHigh Customer Service Calls have major Churn

Figure 36.19 High Customer Service Calls have major Churn

Table 36.4: (C33T04) Service Calls vs. Churn
Churn \(\rightarrow\)
\(\downarrow\) Service Calls
No

Yes

Row
SUM
Row
No %
Row
Yes %
Col
No %
Col
Yes %
Col
Row SUM %
Low 2721 345 3066 88.7% 11.3% 95.5% 71.4% 92%
High 129 138 267 48.3% 51.7% 4.5% 28.6% 8%
Total 2850 483 3333 85.5% 14.5% 100% 100% 100%
  • Similarly, Evening Minutes can be given flags (Categorical)
    • Low (<= 160), Medium (160, 240), High (>240)
    • Recall that the baseline churn rate for all customers is 14.49%. The medium group comes in very close to this baseline rate, 14.1%.
    • However, the High evening minutes group has nearly double the churn proportion (19.5%) compared to the low evening minutes group (10%).
    • The chi-square test is significant, meaning that these results are most likely real and not due to chance alone. “ForLater”

36.9 Correlation

  • The threshold for significance of the correlation coefficient \(r\) depends not only on the sample size but also on data mining, where there are a large number of records (over 1000), even small values of r, such as −0.1 ≤ r ≤ 0.1 may be statistically significant.
    • One should take care to avoid feeding correlated variables to data mining and statistical models.
    • At best, using correlated variables will overemphasize one data component; at worst, using correlated variables will cause the model to become unstable and deliver unreliable results.
    • However, just because two variables are correlated does not mean that we should omit one of them.
      • Identify any variables that are perfectly correlated (i.e., r = 1.0 or r = −1.0). Do not retain both variables in the model, but rather omit one.
      • Identify groups of variables that are correlated with each other. Then, later, during the modeling phase, apply dimension-reduction methods, such as PCA, to these variables.
    • Example: Charge is perfectly correlated to Minutes (For each of Day, Eve, Night, International). Thus it can be eliminated. Thus number of predictors has been reduced from 20 to 16.
    • Correlation coefficient 0.038 between account length and day calls has a small p-value of 0.026, telling us that account length and day calls are positively correlated. We should note this, and prepare to apply the PCA during the modeling phase.
    • “ForLater” - Get Pearson Correlation Coefficient

ALL Continuos Variables

# #All Continuous Predictors and Target (Categorical)
# #Select Relevant | Binning | Rename & Relocate | Relabel & Factor Target|
xsy <- bb %>% select( (where(is.numeric) & ! "area_code") | "churn") %>% 
  relocate(ends_with("_mins")) %>% 
  relocate(ends_with("_calls")) %>% 
  relocate(vmail_message, .after =  last_col()) %>% 
  relocate(Target = "churn") %>% 
  mutate(across(Target, ~ifelse(., "Yes", "No"))) %>% 
  mutate(across(Target, factor, levels = unique(Target)))
# #
xsyl <- xsy %>% pivot_longer(where(is.numeric), names_to = "Predictors", values_to = "Values") %>% 
  mutate(across(Predictors, ~ factor(., levels = unique(Predictors))))
# #
# #Select All Minutes and All Charges (8)
hh <- xsy %>% select(1, 7:10, 12:15) 
#
ttl_hh <- "GGally: Churn: SPLOM: Minutes and Charges (8)"
cap_hh <- "C33P31"
sub_hh <- NULL #"Count of Predictor & Target" 
lgd_hh  <- "Churn"

GGally

# #Assumes Column 1 has Target Variable (Factor)
C33 <- hh %>% { 
  ggpairs(data = ., mapping = aes(colour = Target, fill = Target, alpha = 0.3), 
          columns = 2:ncol(.), 
          lower = list(continuous = f_gg_scatter),
          diag = list(continuous = f_gg_density)) +
    labs(caption = cap_hh, subtitle = sub_hh, title = ttl_hh) 
}
assign(cap_hh, C33)
rm(C33)

Psych

# #IN: hh, cap_hh, ttl_hh, loc_png
if(!file.exists(loc_png)) {
  png(filename = loc_png, width = k_width, height = k_height, units = "in", res = 144) 
  pairs.panels(hh[2:ncol(hh)], smooth = FALSE, jiggle = TRUE, rug = FALSE, ellipses = FALSE, 
               bg = rev(viridis(2))[hh$Target], pch = 21, lwd = 1, cex.cor = 1, cex = 1, 
               gap = 0, main = ttl_hh)
  #title(main = ttl_hh, line = 2, adj = 0)
  title(sub = cap_hh, line = 4, adj = 1)
  C33 <- recordPlot()
  dev.off()
  assign(cap_hh, C33)
  rm(C33)
}

chart.Correlation

# #For Reference Only. Only add Histogram at the Diagonal to the Base pairs()
PerformanceAnalytics::chart.Correlation(hh[2:ncol(hh)], histogram = TRUE)

corrplot() vs. corPlot()

corrplot::corrplot vs. psych::corPlot()

Figure 36.20 corrplot::corrplot vs. psych::corPlot()

36.10 Insights

  • The four ‘charge’ fields are linear functions of the ‘minute’ fields, and should be omitted.
  • The ‘area code’ field and/or the ‘state’ field are anomalous, and should be omitted.
  • Customers with the International Plan tend to churn more frequently.
  • Customers with the Voice Mail Plan tend to churn less frequently.
  • Customers with four or more Customer Service Calls churn more than four times as often as the other customers.
  • Customers with both high Day Minutes and high Evening Minutes tend to churn at a higher rate (6 times) than the other customers.
  • Customers with low Day Minutes and high Customer Service Calls churn at a higher rate than the other customers.
  • Customers with lower numbers of International Calls churn at a higher rate than customers with more international calls.

Validation


37 Dimension Reduction

37.1 Dimensions

10 or 11 Dimensions are enough for the Universe. How many are needed for your data!

Multicollinearity

Definition 37.1 Multicollinearity is a condition where some of the predictor variables are strongly correlated with each other.
  • Problems: Multicollinearity
    • Multicollinearity leads to instability in the solution space, leading to possible incoherent results, such as in multiple regression, where a multicollinear set of predictors can result in a regression which is significant overall, even when none of the individual variables is significant.
    • Even if such instability is avoided, inclusion of variables which are highly correlated tends to overemphasize a particular component of the model, as the component is essentially being double counted.
  • Problems: Too many variables
    • The sample size needed to fit a multivariate function grows exponentially with the number of variables.
    • The use of too many predictor variables to model a relationship with a response variable can unnecessarily complicate the interpretation of the analysis, and violates the principle of parsimony
      • i.e. keep the number of predictors to such a size that would be easily interpreted.
    • Also, retaining too many variables may lead to overfitting
      • i.e. generality of the findings is hindered because new data do not behave the same as the training data for all the variables.

Parsimony

Definition 37.2 Principle of parsimony is the problem-solving principle that “entities should not be multiplied beyond necessity.”
  • It is inaccurately paraphrased as “the simplest explanation is usually the best one.”
  • It advocates that when presented with competing hypotheses about the same prediction, one should select the solution with the fewest assumptions, and that this is not meant to be a way of choosing between hypotheses that make different predictions.

Overfitting & Underfitting

Definition 37.3 Overfitting is the production of an analysis that corresponds too closely to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably.
Definition 37.4 Underfitting occurs when a statistical model cannot adequately capture the underlying structure of the data.
  • An overfitted model is a statistical model that contains more parameters than can be justified by the data.
    • The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e., the noise) as if that variation represented underlying model structure.
    • over-fitting occurs when a model begins to “memorize” training data rather than “learning” to generalize from a trend.
  • Under-fitting would occur, for example, when fitting a linear model to non-linear data. Such a model will tend to have poor predictive performance.
  • Overfitting Example
    • If the number of parameters is the same as or greater than the number of observations, then a model can perfectly predict the training data simply by memorizing the data in its entirety. Such a model, though, will typically fail severely when making predictions.
    • A noisy linear dataset can be fitted by a polynomial function also which would give a perfect fit. However, the linear function (in this case) would be better in extrapolating beyond the fitted data.
  • To decrease the chance or amount of overfitting
    • model comparison, cross-validation, regularization, early stopping, pruning, Bayesian priors, or dropout
    • To explicitly penalize overly complex models
    • To evaluate the model performance on a set of data not used for training, which is assumed to approximate the typical unseen data that a model will encounter.

37.2 Dimension-reduction methods

  • These use the correlation structure among the predictor variables to accomplish the following:
    • To reduce the number of predictor items.
    • To help ensure that these predictor items are independent.
    • To provide a framework for interpretability of the results.
  • Dimension-reduction methods:
    • PCA
    • Factor analysis
    • User-defined composites

37.3 PCA

Definition 37.5 Principal components analysis (PCA) seeks to explain the correlation structure of a set of predictor variables \({m}\), using a smaller set of linear combinations of these variables, called components \({k}\). PCA acts solely on the predictor variables, and ignores the target variable.
  • Suppose that the original variables \({\{X_1, X_2, \ldots, X_m\}}\) form a coordinate system in m-dimensional space.
    • Let each variable \({X_i}\) represent an \({n \times 1}\) vector, where \({n}\) is number of records.
    • The principal components represent a new coordinate system, found by rotating the original system along the directions of maximum variability.
  • Analysis
    • Standardize the data, so that the mean for each variable is zero, and the standard deviation is one.
    • \({X_i \to Z_i = \frac{X_i - {\mu}_i}{{\sigma}_{ii}}}\)
    • The covariance is a measure of the degree to which two variables vary together.
    • A positive covariance indicates that, when one variable increases, the other tends to increase, while a negative covariance indicates that, when one variable increases, the other tends to decrease.
    • \({{\sigma}_{ii}^2}\) denotes the variance of \({X_i}\).
      • If \({X_i}\) and \({X_j}\) are independent, then \({{\sigma}_{ij}^2 = 0}\); but reverse may not be TRUE i.e. \({{\sigma}_{ij}^2 = 0}\) does not imply that \({X_i}\) and \({X_j}\) are independent.
      • Note that the covariance measure is not scaled, so that changing the units of measure would change the value of the covariance.
      • The correlation coefficient \({r_{ij}} = \frac{{\sigma}_{ij}^2}{{\sigma}_{ii}{\sigma}_{jj}}\) avoids this difficulty by scaling the covariance by each of the standard deviations.
      • Then, the correlation matrix is denoted as \({\rho}\)
Definition 37.6 Let \(\mathbf{B}\) be an \(m \times m\) matrix, and let \(\mathbf{I}\) be the \(m \times m\) identity matrix. Then the scalars \(\{\lambda_1, \lambda_2, \ldots, \lambda_m\}\) are said to be the eigenvalues of \(\mathbf{B}\) if they satisfy \(|\mathbf{B} - \lambda \mathbf{I}| = 0\), where \(|\mathbf{Q}|\) denotes the determinant of Q.
Definition 37.7 Let \(\mathbf{B}\) be an \(m \times m\) matrix, and let \({\lambda}\) be an eigenvalue of \(\mathbf{B}\). Then nonzero \(m \times 1\) vector \(\overrightarrow{e}\) is said to be an eigenvector of B, if \(\mathbf{B} \overrightarrow{e} = 𝜆\overrightarrow{e}\).
  • The total variability in the standardized set of predictors equals the sum of the variances of the Z-vectors, which equals the sum of the variances of the components, which equals the sum of the eigenvalues, which equals the numer of predictors
    • i.e. \(\sum_{i=1}^m {\text{Var}({Y_i})} = \sum_{i=1}^m {\text{Var}({Z_i})} = \sum_{i=1}^m {\lambda_i} = m\)
  • The partial correlation between a given component and a given predictor variable is a function of an eigenvector and an eigenvalue. Specifically, \(\text{Corr}(Y_i, Z_j) = e_{ij}\sqrt{\lambda_i}\), where \(\{ (\lambda_1, e_1), (\lambda_2, e_2), \ldots, (\lambda_m, e_m)\}\) are the eigenvalue–eigenvector pairs for the correlation matrix \(\rho\), and we note that \(\lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_m\). In other words, the eigenvalues are ordered by size. (A partial correlation coefficient is a correlation coefficient that takes into account the effect of all the other variables.)
  • The proportion of the total variability in Z that is explained by the \(i^{\text{th}}\) principal component is the ratio of the \(i^{\text{th}}\) eigenvalue to the number of variables, that is, the ratio \(\frac{\lambda_i}{m}\).

37.4 Data Housing

Please import the "C34-cadata.txt".

  • Source: http://lib.stat.cmu.edu/datasets/houses.zip
  • About: [20640, 9]
    • It provides census information from all the block groups from the 1990 California census.
    • For this data set, a block group has an average of 1425.5 people living in an area that is geographically compact.
    • Block groups were excluded that contained zero entries for any of the variables.
    • Variables: median house value (Target), median income, housing median age, total rooms, total bedrooms, population, households, latitude, and longitude.

37.5 Partition Train & Test Data

set.seed(3)  #wwww
# #sample() and its variants can be used for Partitioning of Dataset
if(FALSE) {
  # #This approach is difficult to extend for 3 or more splits
  # #Further, Using floor() multiple times might result in loss of a row 
  train_idx <- sample.int(n = nrow(bb), size = floor(0.8 * nrow(bb)), replace = FALSE)
  train_idx <- sample(seq_len(nrow(bb)), size = floor(0.8 * nrow(bb)), replace = FALSE)  
  train_bb <- bb[train_idx, ]
  test_bb <- bb[-train_idx, ]
}
#
# #For 3 or more splits
#brk_bb = c(train = 0.8, test = 0.1, validate = 0.1)
brk_bb = c(train = 0.9, test = 0.1)
idx_bb = sample(cut(seq_len(nrow(bb)), nrow(bb) * cumsum(c(0, brk_bb)), labels = names(brk_bb)))
#
# #Splits by Fixed Numbers not Percentages
if(FALSE) {
  brk_bb = c(train = 18570, test = nrow(bb))
  idx_bb = sample(cut(seq_len(nrow(bb)), c(0, brk_bb), labels = names(brk_bb)))
}
# #List of Multiple Tibbles
part_l = split(bb, idx_bb)
#
# #nrow(), ncol(), dim() can be applied
vapply(part_l, nrow, FUN.VALUE = integer(1))
## train  test 
## 18576  2064
stopifnot(identical(nrow(bb), sum(vapply(part_l, nrow, FUN.VALUE = integer(1)))))

37.6 Basics

Table 37.1: (C34T01)[18576, 9] Houses: Training Basics
Keys Min Max SD Mean Median Mode Unique isNA
House Value (Median) 14999 500001 115304.56 206928.9 179700 500001 3767 0
Income (Median) 0.5 15 1.9 3.87 3.53 15 11952 0
House Age (Median) 1 52 12.57 28.6 29 52 52 0
Rooms (Total) 2 39320 2191.95 2641.99 2127 1613 5733 0
Bedrooms (Total) 1 6445 425.22 539.53 435 280 1886 0
Population 3 35682 1140.33 1427.82 1166 850 3781 0
Households 1 6082 385.55 501.02 410 306 1773 0
Latitude 32.5 42 2.13 35.62 34.25 34.1 851 0
Longitude -124.3 -114 2 -119.56 -118.49 -118.3 832 0

37.7 Boxplot

Houses: Boxplot (Scaled)

Figure 37.1 Houses: Boxplot (Scaled)

37.8 Normality

Note that normality of the data is not strictly required to perform non-inferential PCA but that strong departures from normality may diminish the observed correlations. As data mining applications usually do not involve inference, we will not worry about normality.

37.9 Predictors SPLOM

  • Rooms, bedrooms, population, and households all appear to be positively correlated.
  • Latitude and longitude appear to be negatively correlated.
    • Scatter plot between them looks like the State of California
  • House Median Age appears to be correlated the least with the other predictors
Houses: SPLOM (8)

Figure 37.2 Houses: SPLOM (8)

37.10 Predictors Corplot

Houses: Corrplot (8)

Figure 37.3 Houses: Corrplot (8)

37.11 Correlation

Correlation Matrix

Table 37.2: (C34T02) Houses: Correlation Matrix
income h_age rooms bedrooms population households latitude longitude
income
h_age -0.117
rooms 0.193 -0.36
bedrooms -0.013 -0.32 0.93
population 0.001 -0.29 0.86 0.88
households 0.008 -0.3 0.92 0.98 0.9
latitude -0.078 0.01 -0.04 -0.07 -0.1 -0.07
longitude -0.017 -0.11 0.05 0.07 0.1 0.06 -0.9

Matrices

  • cor() : Correlation Function produces a Matrix
    • Matirces has HUGE number of problems but unfortunately some function output is in that form
    • names() does NOT work on Matrices but colnames() works, even though names() is superior to colnames() in all other aspects
    • Symmatrical Matrix. Diagonal and one of the Triangles (Upper or Lower) are redundant
    • Too many decimal printing
# #cor() produces a Matrix
ii <- cor(zw)
str(ii)
##  num [1:8, 1:8] 1 -0.11702 0.192888 -0.012576 0.000955 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:8] "income" "h_age" "rooms" "bedrooms" ...
##   ..$ : chr [1:8] "income" "h_age" "rooms" "bedrooms" ...
ii
##                  income       h_age       rooms    bedrooms   population   households    latitude
## income      1.000000000 -0.11701992  0.19288773 -0.01257616  0.000954879  0.008423239 -0.07844076
## h_age      -0.117019916  1.00000000 -0.35816935 -0.31683614 -0.292810501 -0.299390249  0.01077015
## rooms       0.192887732 -0.35816935  1.00000000  0.93037756  0.857256554  0.919196762 -0.03702508
## bedrooms   -0.012576164 -0.31683614  0.93037756  1.00000000  0.877848495  0.980039774 -0.06683883
## population  0.000954879 -0.29281050  0.85725655  0.87784849  1.000000000  0.907065416 -0.10872704
## households  0.008423239 -0.29939025  0.91919676  0.98003977  0.907065416  1.000000000 -0.07121960
## latitude   -0.078440760  0.01077015 -0.03702508 -0.06683883 -0.108727036 -0.071219604  1.00000000
## longitude  -0.016593996 -0.10712695  0.04526591  0.06873347  0.100075888  0.055907965 -0.92508077
##              longitude
## income     -0.01659400
## h_age      -0.10712695
## rooms       0.04526591
## bedrooms    0.06873347
## population  0.10007589
## households  0.05590797
## latitude   -0.92508077
## longitude   1.00000000
#
# #We can eliminate Lower Triangle and Diagonal. However NA does not print well with format()
# #outcome of upper.tri() is easily compared to as.table(). lower.tri() will need extra step
#
# #Take advatage of Matrix Triangle and Set to 0 for later handling by format()
# #IF we remove the diagonal then dimensions gets haywire i.e. 8 to 7 columns left [28] elements
# #IF we keep the diagonal then dimensions gets haywire i.e. 8 to 9 columns left [36] elements
# #So, cannot use NA, has to use ZERO (So that, later, format can replace it.)
#
# #However, we finally went ahead with dplyr solution which handled NA separately from format()
# #Thus eliminating need of assigning 0, NA are being used for redundant triangle and diagonal
#
kk <- ii #ii is FULL 8x8 Matrix
kk[upper.tri(kk, diag = TRUE)] <- NA 
mm  <- kk %>% as.table() %>% as_tibble(.name_repair = 'unique') %>% drop_na() %>% 
  filter(n > 0.5 | n < -0.5) %>% arrange(desc(abs(n)))
#
kk <- ii #ii is FULL 8x8 Matrix
nn  <- kk %>% as.table() %>% as_tibble(.name_repair = 'unique') %>% 
    filter(...1 != ...2) %>% 
    filter(!duplicated(paste0(pmax(as.character(...1), as.character(...2)), 
                              pmin(as.character(...1), as.character(...2))))) %>%
  filter(n > 0.5 | n < -0.5) %>% arrange(desc(abs(n)))
stopifnot(identical(mm, nn))
#
# #However, above are long because of the usage of as.table(). For Wide:
ll <- kk %>% #as.table() %>% 
  as_tibble() %>% 
  mutate(ID = row_number()) 
# #Get Names
oo <- names(ll)
ll %>% mutate(across(-ID, ~ ifelse(ID <= match(cur_column(), oo), NA, .x))) %>% select(-ID)
## # A tibble: 8 x 8
##      income   h_age   rooms bedrooms population households latitude longitude
##       <dbl>   <dbl>   <dbl>    <dbl>      <dbl>      <dbl>    <dbl> <lgl>    
## 1 NA        NA      NA       NA          NA        NA        NA     NA       
## 2 -0.117    NA      NA       NA          NA        NA        NA     NA       
## 3  0.193    -0.358  NA       NA          NA        NA        NA     NA       
## 4 -0.0126   -0.317   0.930   NA          NA        NA        NA     NA       
## 5  0.000955 -0.293   0.857    0.878      NA        NA        NA     NA       
## 6  0.00842  -0.299   0.919    0.980       0.907    NA        NA     NA       
## 7 -0.0784    0.0108 -0.0370  -0.0668     -0.109    -0.0712   NA     NA       
## 8 -0.0166   -0.107   0.0453   0.0687      0.100     0.0559   -0.925 NA

f_pKblM()

f_pKblM <- function(x, caption, isTri = TRUE, negPos = c(-0.0000001, 0.0000001), dig = 1L, ...) {
# #Description: 
# Prints Kable Matrix Standard Format: f_pKblM(hh, cap_hh)
# Calls: f_pKbl()
# #Arguments: 
# x: Matrix
# caption: Table Title with Table Number in "(AXXTYY)" Format
# isTri: When TRUE (Default) prints complete Matrix otherwise Lower Triangle Only
# negPos: Vector of 2 values, to apply 3 colours to labels
# dig: Number of decimal places
# ... : Everything else is passed to f_pKbl()
#
  stopifnot(identical(length(negPos), 2L))
#
# #outcome of upper.tri() is easily compared to as.table(). lower.tri() will need extra step
  if(isTri) x[upper.tri(x, diag = TRUE)] <- NA
#
# #Suppress Warnings because 1 column is completely NA on which mutate(across()) is applied
# #Keeping the column is better to be seen as Matrix in this specific case of Correlation Matrix
# #Warning messages: no non-missing arguments to min; returning Inf
# #Warning messages: no non-missing arguments to max; returning -Inf
#
  x <- suppressWarnings(x %>%
# #Using as.table() gives long, otherwise wide
    #as.table() %>%
    as_tibble(rownames = NA, .name_repair = 'unique') %>%
# #Value based conditional formatting needs to happen before kbl() is called because 
# #mutate() does not work on kbl
# #format() needs to be called inside cell_spec() itself 
# #format cannot be applied later because once the value is modified for kbl() it becomes numeric
# #format cannot be applied before because it changes the value to character 
    mutate(across(everything(),
                  ~ cell_spec(ifelse(is.na(.x), "",
                      format(.x, digits = dig, scientific = FALSE, drop0trailing = TRUE)),
# #Change na_font_size to 1 or higher number to see bigger visual blobs on NA
                      font_size = spec_font_size(abs(.x), na_font_size = 0),
                      color = ifelse(.x < 0 | is.na(.x), "black", "black"),
                      background = case_when(is.na(.x) ~ "black",
                                       .x < negPos[1] ~ "#D8B365",
                                       .x >= negPos[2] ~ "#5AB4AC",
                                       TRUE ~ "grey")))))
  result <- f_pKbl(x, caption = caption, ...)
  return(result)
# #xxCLOSE: f_pKblM()
}

f_pKbl()

f_pKbl <- function(x, caption, headers = names(x), debug = FALSE, maxrows = 30L) {
# #Print Kable Standard Formats: f_pKbl(hh, cap_hh, headers = names_hh, debug = TRUE)
# #Kable Prints FULL DATASET passed to it.
# #names() does NOT work on Matrices but colnames() works
# #even though names() is superior to colnames() in all other aspects
# #We can do a conditional check on type ane then call relevant function but for now
# #Supply colnames() manually if using Matrices
#
  if(nrow(x) > maxrows) {
  #Print only the Head of Big Datasets by checking if it has more rows than maxrows
    x <- head(x)
  }
  txt_colour  <- ifelse(debug, "black", "white")
  result <- kbl(x,
    caption = cap_hh,
    col.names = headers,
    escape = FALSE, align = "c", booktabs = TRUE
    ) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                  html_font = "Consolas",   font_size = 12,
                  full_width = FALSE,
                  #position = "float_left",
                  fixed_thead = TRUE
    ) %>%
# #Header Row Dark & Bold: RGB (48, 48, 48) =HEX (#303030)
      row_spec(0, color = "white", background = "#303030", bold = TRUE,
               extra_css = "border-bottom: 1px solid; border-top: 1px solid"
      ) %>% row_spec(row = 1:nrow(x), color = txt_colour)
  return(result)
# #xxCLOSE: f_pKbl()
}

37.12 PCA

  • There are two general methods to perform PCA in R :
    • Spectral decomposition which examines the covariances / correlations between variables
      • princomp()
        • It uses divisor \(N\) for the covariance matrix.
      • Names: “sdev, loadings, center, scale, n.obs, scores, call”
    • Singular value decomposition (SVD) which examines the covariances / correlations between individuals
      • SVD has slightly better numerical accuracy
      • prcomp()
        • Unlike princomp, variances are computed with the usual divisor \(N - 1\).
        • cov() also uses \(N - 1\)
        • Use sdev or sqrt(Eigenvalues) to convert Eigenvectors into Loadings (For BOOK /psych Comparison)
      • Names: “sdev, rotation, center, scale, x”
  • Output contains
    • SD of Principal Components
    • rotation / loadings: the matrix of variable loadings (columns are eigenvectors)
    • x / scores: The coordinates of the individuals (observations) on the principal components.
  • Understanding the result
    • PCA was carried out on the eight predictors in the house data set.
    • PCA was carried out on the eight predictors in the house data set. The component matrix is shown in Table 37.3.
    • Each of the columns in Table represents one of the compnonents \(Y_i = e_i^T\mathbf{Z}\).
    • The cell entries are called the component weights, and represent the partial correlation between the variable and the component.
      • As the component weights are correlations, they range between one and negative one.
      • NOTE: Sign may differ from the Book.
      • Eigenvalues are given by \({s}^2\) as shown in Table 37.4
      • First Eigenvalue is 3.9 and there are 8 predictor variables, thus, first component (PC1) explains \(3.9/8 \approx 48\%\) of the variance
        • i.e. this single component by itself carries about half of the information in all eight predictors.
        • In general, the first principal component may be viewed as the single best summary of the correlations among the predictors. Specifically, this particular linear combination of the variables accounts for more variability than any other linear combination.
        • The second principal component is the second-best linear combination of the variables, on the condition that it is orthogonal to the first principal component. It is derived from the variability that is left over, once the first component has been accounted for.
Definition 37.8 Two vectors are orthogonal if they are mathematically independent, have no correlation, and are at right angles to each other.

Component Matrix

# #Perform PCA by prcomp() #wwww
#ii <- princomp(zw)
pca_zw <- prcomp(zw)
#
names(pca_zw)
## [1] "sdev"     "rotation" "center"   "scale"    "x"
#
# #Principal components have "loadings" i.e. $rotation and "scores" i.e. $x
# #Loadings specify the weight that each variable contributes to the principal component.
# #Scores show the value each sample has on each principal component.
#
dim(pca_zw$rotation)
## [1] 8 8
dim(pca_zw$x)
## [1] 18576     8
#
# #Matrix Multiplication i.e. %*% of original variables with loadings gives scores
bb <- as.matrix(zw) %*% pca_zw$rotation
all.equal(bb, pca_zw$x)
## [1] TRUE
identical(round(bb, 5), round(pca_zw$x, 5))
## [1] TRUE
#
summary(pca_zw)$importance
##                             PC1      PC2      PC3       PC4       PC5       PC6       PC7       PC8
## Standard deviation     1.975794 1.381294 1.035965 0.9078472 0.3852156 0.2846245 0.2162878 0.1211411
## Proportion of Variance 0.487970 0.238500 0.134150 0.1030200 0.0185500 0.0101300 0.0058500 0.0018300
## Cumulative Proportion  0.487970 0.726470 0.860620 0.9636400 0.9821900 0.9923200 0.9981700 1.0000000
#
pca_eigen <- summary(pca_zw)$importance %>% t() %>% as_tibble(rownames = "PCA") %>% 
  rename(SD = 2, pVar = 3, pVarCum = 4) %>% 
  mutate(EigenVal = SD^2, pVarManual = EigenVal/sum(EigenVal), 
         isOne = ifelse(EigenVal > 1, "Yes", "No"),
         isNinty = ifelse(pVarCum < 0.9, "Yes", "No"))
Table 37.3: (C34T03) Houses: PCA Component Matrix
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
Income (Median) -0.04 0.03 -0.89 0.41 0.06 -0.06 -0.17 -0.041
House Age (Median) 0.22 -0.02 0.4 0.88 -0.03 0.09 -0.04 -0.004
Rooms (Total) -0.48 -0.07 -0.09 0.11 -0.32 0.56 0.55 0.152
Bedrooms (Total) -0.49 -0.06 0.12 0.06 -0.38 -0.23 -0.22 -0.703
Population -0.47 -0.03 0.11 0.08 0.85 0.13 -0.02 -0.133
Households -0.49 -0.06 0.11 0.1 -0.14 -0.4 -0.31 0.678
Latitude 0.07 -0.7 -0.01 -0.1 -0.05 0.47 -0.52 0.035
Longitude -0.08 0.7 0.05 -0.07 -0.1 0.48 -0.5 0.048

Code

# #Perform PCA by prcomp() #wwww
#ii <- princomp(zw)
pca_zw <- prcomp(zw)
#
names(pca_zw)
## [1] "sdev"     "rotation" "center"   "scale"    "x"
#
# #Principal components have "loadings" i.e. $rotation and "scores" i.e. $x
# #Loadings specify the weight that each variable contributes to the principal component.
# #Scores show the value each sample has on each principal component.
#
dim(pca_zw$rotation)
## [1] 8 8
dim(pca_zw$x)
## [1] 18576     8
#
# #Matrix Multiplication i.e. %*% of original variables with loadings gives scores
bb <- as.matrix(zw) %*% pca_zw$rotation
all.equal(bb, pca_zw$x)
## [1] TRUE
identical(round(bb, 5), round(pca_zw$x, 5))
## [1] TRUE
#
summary(pca_zw)$importance
##                             PC1      PC2      PC3       PC4       PC5       PC6       PC7       PC8
## Standard deviation     1.975794 1.381294 1.035965 0.9078472 0.3852156 0.2846245 0.2162878 0.1211411
## Proportion of Variance 0.487970 0.238500 0.134150 0.1030200 0.0185500 0.0101300 0.0058500 0.0018300
## Cumulative Proportion  0.487970 0.726470 0.860620 0.9636400 0.9821900 0.9923200 0.9981700 1.0000000
#
pca_eigen <- summary(pca_zw)$importance %>% t() %>% as_tibble(rownames = "PCA") %>% 
  rename(SD = 2, pVar = 3, pVarCum = 4) %>% 
  mutate(EigenVal = SD^2, pVarManual = EigenVal/sum(EigenVal), 
         isOne = ifelse(EigenVal > 1, "Yes", "No"),
         isNinty = ifelse(pVarCum < 0.9, "Yes", "No"))

Loadings and Eigenvectors

#wwww
# #Component Matrix of prcomp() does not match with either BOOK or psych::principal()
# #prcomp() rotation contains eigenvectors not loadings. 
# #Loadings = Eigenvectors * sqrt(Eigenvalues) = Eigenvectors * sdev
#
# #psych::principal()
psy_zw <- principal(zw, nfactors = ncol(zw), rotate = 'none', scores = TRUE)
names(psy_zw)
##  [1] "values"       "rotation"     "n.obs"        "communality"  "loadings"     "fit"         
##  [7] "fit.off"      "fn"           "Call"         "uniquenesses" "complexity"   "chi"         
## [13] "EPVAL"        "R2"           "objective"    "residual"     "rms"          "factors"     
## [19] "dof"          "null.dof"     "null.model"   "criteria"     "STATISTIC"    "PVAL"        
## [25] "weights"      "r.scores"     "Vaccounted"   "Structure"    "scores"
#
#psy_zw$loadings
#
# #To Match them Multiply by SD = sqrt(Eigenvalues)
sd_pca <- summary(pca_zw)$sdev
eigen_pca <- sd_pca ^ 2
#
# #Multiply PC1 column with sqrt(Eigenvalue) of PC1 i.e. SD and so on
load_pca <- t(t(pca_zw$rotation) * sd_pca)
#
round(load_pca, 3)
##               PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8
## income     -0.083  0.047 -0.921  0.374  0.021 -0.016 -0.036 -0.005
## h_age       0.428 -0.021  0.413  0.803 -0.013  0.026 -0.009  0.000
## rooms      -0.956 -0.103 -0.097  0.104 -0.122  0.159  0.119  0.018
## bedrooms   -0.970 -0.084  0.120  0.057 -0.145 -0.066 -0.048 -0.085
## population -0.933 -0.036  0.118  0.074  0.327  0.036 -0.004 -0.016
## households -0.972 -0.088  0.112  0.087 -0.054 -0.114 -0.066  0.082
## latitude    0.145 -0.970 -0.012 -0.089 -0.018  0.133 -0.113  0.004
## longitude  -0.150  0.969  0.057 -0.063 -0.037  0.137 -0.109  0.006
round(psy_zw$loadings, 3)
## 
## Loadings:
##            PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8   
## income                    0.921  0.374                            
## h_age      -0.428        -0.413  0.803                            
## rooms       0.956  0.103         0.104  0.122  0.159 -0.119       
## bedrooms    0.970        -0.120         0.145                     
## population  0.933        -0.118        -0.327                     
## households  0.972        -0.112               -0.114              
## latitude   -0.145  0.970                       0.133  0.113       
## longitude   0.150 -0.969                       0.137  0.109       
## 
##                  PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8
## SS loadings    3.904 1.909 1.072 0.824 0.148 0.081 0.047 0.015
## Proportion Var 0.488 0.239 0.134 0.103 0.019 0.010 0.006 0.002
## Cumulative Var 0.488 0.727 0.861 0.964 0.982 0.992 0.998 1.000

37.13 Orthogonality of PCA

Image

Houses: PCA Corrplot - ALL are ZERO

Figure 37.4 Houses: PCA Corrplot - ALL are ZERO

Code

# #Correlation Matrix | Table (Long) | Tibble
hh <- cor(pca_zw$x) %>% as.table() %>% as_tibble(.name_repair = 'unique') %>% 
  #filter(...1 != ...2) %>% 
  filter(!duplicated(paste0(pmax(as.character(...1), as.character(...2)), 
                            pmin(as.character(...1), as.character(...2))))) #%>% 
  #mutate(across(where(is.character), factor, levels = c_zsyw[-1], labels = names(c_zsyw)[-1]))
#
ttl_hh <- "Houses: PCA Corrplot - ALL are ZERO"
cap_hh <- "C34P04"

37.14 ScreePlot

Image

House: PCA Screeplot with VarianceHouse: PCA Screeplot with Variance

Figure 37.5 House: PCA Screeplot with Variance

Table 37.4: (C34T04) Houses: PCA Eigenvalues & Variance
PCA SD exp_Variance cum_Var EigenVal isEigenOne isVarNinty
PC1 1.976 0.48797 0.488 3.9038 Yes Yes
PC2 1.381 0.2385 0.726 1.908 Yes Yes
PC3 1.036 0.13415 0.861 1.0732 Yes Yes
PC4 0.908 0.10302 0.964 0.8242 No No
PC5 0.385 0.01855 0.982 0.1484 No No
PC6 0.285 0.01013 0.992 0.081 No No
PC7 0.216 0.00585 0.998 0.0468 No No
PC8 0.121 0.00183 1 0.0147 No No
Total NA NA NA 8 NA NA

Code

hh <- pca_eigen
#
ttl_hh <- "Houses: PCA Eigenvalue ScreePlot"
cap_hh <- "C34P05"
y_hh <- "Eigenvalue"
# #IN: hh 
C34 <- hh %>% {ggplot(., aes(x = PCA, y = EigenVal)) + 
    geom_point(aes(color = isOne), size = 3) +
    geom_line(aes(group = 1)) +
    geom_hline(aes(yintercept = 1), color = '#440154FF', linetype = "dashed") +
    annotate("segment", x = 3.5, xend = 3.1, y = 1.6, 
                    yend = 1.2, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
    annotate("segment", x = 4.5, xend = 4.1, y = 1.3, 
                    yend = 0.9, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
    annotate("segment", x = 5.5, xend = 5.1, y = 0.6, 
                    yend = 0.2, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
    geom_text(data = tibble(x = c(3.5, 4.5, 5.5), y = c(1.7, 1.4, 0.7), 
              labels = c("Eigenvalue Criterion", "Screeplot Criterion", "Elbow Point")), 
              aes(x=x, y=y, label=labels), check_overlap = TRUE) + 
    scale_fill_distiller(palette = "BrBG") +
    #coord_fixed() +
    theme(legend.position = 'none') +
      labs(y = y_hh, subtitle = NULL, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, C34)
rm(C34)
ttl_hh <- "Houses: PCA Proportion of Variance Explained"
cap_hh <- "C34P06"
y_hh <- NULL
# #IN: hh 
C34 <- hh %>% {ggplot(., aes(x = PCA, y = pVarCum)) + 
    geom_point(aes(color = isNinty), size = 3) +
    geom_line(aes(group = 1)) +
    geom_hline(aes(yintercept = 0.9), color = '#440154FF', linetype = "dashed") +
    annotate("segment", x = 4, xend = 4, y = 0.83, 
                    yend = 0.93, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
    geom_text(data = tibble(x = 5.5, y = 0.8, labels = c("Proportion of Variance Covered Criterion")), 
              aes(x=x, y=y, label=labels), check_overlap = TRUE) + 
    scale_fill_distiller(palette = "BrBG") +
    scale_y_continuous(limits = c(0, 1), labels=percent) + 
    theme(legend.position = 'none') +
    labs(y = y_hh,
         subtitle = NULL, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, C34)
rm(C34)

37.15 How many Components

  • In our example Single Component (PC1) can account for approximately half of the variability. But, all 8 accounts for 100% variability. So between 1 and 8 where is the cut-off
    • Criteria
      • The Eigenvalue Criterion
      • The Proportion of Variance Explained Criterion
      • The Minimum Communality Criterion - Deferred by Book
      • The Scree Plot Criterion
  • The Eigenvalue Criterion
    • Refer Figure 37.5 and Table 37.4
    • Sum of the eigenvalues represents the number of variables entered into the PCA i.e. 8
    • An eigenvalue of 1 would then mean that the component would explain about “one variable worth” of the variability.
    • Therefore, the eigenvalue criterion states that only components with eigenvalues greater than 1 should be retained.
    • Note that, if there are fewer than 20 variables, the eigenvalue criterion tends to recommend extracting too few components, while, if there are more than 50 variables, this criterion may recommend extracting too many.
    • Thus in example: 3 can be retained. PC4 has value around 0.8 so it may or may not be retained.
  • The Proportion of Variance Explained Criterion
    • We can define how much of the total variability that we would like the principal components to account for and then selects them acccordingly
    • Thus in example: 3 can be retained. PC4 will be selected if more than 90% should be accounted
  • The Scree Plot Criterion
    • Elbow Point: The maximum number of components that should be extracted is just before where the plot first begins to straighten out into a horizontal line.

37.16 Factor Scores

  • Modified Table 37.3 as 37.5
  • To investigate the relationship between PC3 and PC4, and their constituent variables, we next consider factor scores. Factor scores are estimated values of the factors for each observation, and are based on factor analysis.
Table 37.5: (C34T05) Houses: PCA Eigenvectors upto PC4
PC1 PC2 PC3 PC4
Income -0.889 0.412
Age 0.216 0.399 0.885
Rooms -0.484
Beds -0.491
Pop -0.472
Houses -0.492
Lat -0.702
Long 0.701
Correlation Matrices of Factor Scores of PC3 and PC4Correlation Matrices of Factor Scores of PC3 and PC4

Figure 37.6 Correlation Matrices of Factor Scores of PC3 and PC4

  • Consider the left side of Figure 37.6
    • The strong negative correlation between component 3 and median income is strikingly evident
    • But the relationship between component 3 and housing median age is rather amorphous. It would be difficult to estimate the correlation between component 3 and housing median age as being 0.413, with only the scatter plot to guide us.
  • Similarly for the right side of Figure 37.6
    • The relationship between component 4 and housing median age is crystal clear
    • while the relationship between component 4 and median income is not entirely clear, reflecting its lower positive correlation of 0.374.
    • We conclude, therefore, that the component weight of 0.413 for housing median age in component 3 is not of practical significance, and similarly for the component weight for median income in component 4.
  • This discussion leads us to the following criterion for assessing the component weights.
    • For a component weight to be considered of practical significance, it should exceed \(\pm 0.5\) in magnitude.
    • Note that the component weight represents the correlation between the component and the variable; thus, the squared component weight represents the amount of the total variability of the variable that is explained by the component.
    • Thus, this threshold value of \(\pm 0.5\) requires that at least 25% of the variance of the variable be explained by a particular component.
  • Thus, the Table 37.3 is modified further as 37.6
    • NOTE: Sign differ from the Book but that is OK.
    • NOTE: Valeus should be matching when eigenvectors are converted to loadings
      • Note that the partition of the variables among the four components is mutually exclusive, meaning that no variable is shared (after suppression) by any two components
      • and exhaustive, meaning that all eight variables are contained in the four components.
Table 37.6: (C34T06) Houses: PCA Loadings (not Eigenvectors) upto PC4
PC1 PC2 PC3 PC4
Income -0.921
Age 0.803
Rooms -0.956
Beds -0.97
Pop -0.933
Houses -0.972
Lat -0.97
Long 0.969

37.17 Matrix

Create

# #Create a Matrix
# #Default: The Matrix is Filled Column by Column
matrix(1:9, nrow = 3)
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
#
# #Change Behavour to fill the matrix by Row
matrix(1:9, nrow = 3, byrow = TRUE)
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

Multiply Vector

# #names() does not work on Matrix
mm <- matrix(1:9, ncol = 3, byrow = TRUE)
rownames(mm) <- tail(letters, 3)
colnames(mm) <- head(letters, 3)
mm
##   a b c
## x 1 2 3
## y 4 5 6
## z 7 8 9
#
vv <- 1:3
#
# #Matrix Multipley (Deprecated)
# #Multiply the Vector with Matrix Rows i.e. x * 1, y * 2, z * 3
ii <- diag(vv) %*% mm #loss of rownames because it is taken from Left Side Matrix
#
# #Multiply the Vector with Matrix Columns i.e. a * 1, b * 2, c * 3
jj <- mm %*% diag(vv) #loss of colnames because it is taken from Right Side Matrix
#
# #Check Attributes
attributes(ii)$dimnames
## [[1]]
## NULL
## 
## [[2]]
## [1] "a" "b" "c"
attributes(jj)$dimnames
## [[1]]
## [1] "x" "y" "z"
## 
## [[2]]
## NULL
#
# #Add Missing RowNames or ColNames 
rownames(ii) <- rownames(mm)
colnames(jj) <- colnames(mm)
#
# #Coercing by as.integer() will produce a vector not matrix. Use eiher mode() or class()
#ii[] <- as.integer(ii) #Even using [] is NOT coercing to integer
class(ii) <- "integer"
mode(jj) <- "integer" 
#
# #Print
ii
##    a  b  c
## x  1  2  3
## y  8 10 12
## z 21 24 27
str(ii)
##  int [1:3, 1:3] 1 8 21 2 10 24 3 12 27
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:3] "x" "y" "z"
##   ..$ : chr [1:3] "a" "b" "c"
jj
##   a  b  c
## x 1  4  9
## y 4 10 18
## z 7 16 27
str(jj)
##  int [1:3, 1:3] 1 4 7 4 10 16 9 18 27
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:3] "x" "y" "z"
##   ..$ : chr [1:3] "a" "b" "c"
#
# #Equivalent: sweep() (Deprecated). SLOW, However it keeps the dimnames.  
swp_ii <- sweep(mm, MARGIN = 1, vv, `*`)
swp_jj <- sweep(mm, MARGIN = 2, vv, `*`)
#
# #Recommended:
# #Equivalent: R Recycle Vector Column-wise. So double-transpose is needed if multiplying on jj.
# #Double Transpose is FASTEST & retains dimnames. Bonus: This is Commutative.
rec_ii <- mm * vv
com_ii <- vv * mm 
rec_jj <- t(t(mm) * vv)
com_jj <- t(vv * t(mm))
all(identical(rec_ii, com_ii), identical(rec_jj, com_jj)) #Commutative
## [1] TRUE
#
all(identical(ii, swp_ii), identical(jj, swp_jj), identical(ii, rec_ii), identical(jj, rec_jj))
## [1] TRUE

Square Elements

mm <- matrix(1:9, ncol = 3, byrow = TRUE)
rownames(mm) <- tail(letters, 3)
colnames(mm) <- head(letters, 3)
mm
##   a b c
## x 1 2 3
## y 4 5 6
## z 7 8 9
#
# #Square Each Element of the Matrix
mm ** 2 # ** operator is highly obscure and is actually parsed as ^ so use that not this one
##    a  b  c
## x  1  4  9
## y 16 25 36
## z 49 64 81
mm ^ 2
##    a  b  c
## x  1  4  9
## y 16 25 36
## z 49 64 81
stopifnot(identical(mm ^ 2, mm ** 2))

37.18 Communalities

Definition 37.9 PCA does not extract all the variance from the variables, but only that proportion of the variance that is shared by several variables. Communality represents the proportion of variance of a particular variable that is shared with other variables. Communality values are calculated as the sum of squared component weights, for a given variable.
  • The communalities represent the overall importance of each of the variables in the PCA as a whole.
  • Communalities that are very low for a particular variable should be an indication that the particular variable might not participate in the PCA solution (i.e., might not be a member of any of the principal components).
  • Communalities less than 0.5 can be considered to be too low, as this would mean that the variable shares less than half of its variability in common with the other variables.
    • Now, if we want to keep the variable housing median age as an active part of the analysis, then, extracting only three components would not be adequate, as housing median age shares only 35% of its variance with the other variables. To keep this variable in the analysis, we would need to extract the fourth component, which lifts the communality for housing median age over the 50% threshold.
  • Minimum Communality Criterion
    • Enough components should be extracted so that the communalities for each of these variables exceed a certain threshold (e.g., 50%)
Table 37.7: (C34T07) Houses: PCA Loadings with Communalities
PC1 PC2 PC3 PC4 Comm_PC3 Comm_PC4
Income -0.083 0.047 -0.921 0.374 0.86 1
Age 0.428 -0.021 0.413 0.803 0.35 1
Rooms -0.956 -0.103 -0.097 0.104 0.93 0.95
Beds -0.97 -0.084 0.12 0.057 0.96 0.97
Pop -0.933 -0.036 0.118 0.074 0.89 0.89
Houses -0.972 -0.088 0.112 0.087 0.97 0.97
Lat 0.145 -0.97 -0.012 -0.089 0.96 0.97
Long -0.15 0.969 0.057 -0.063 0.96 0.97

37.19 Decision on How many Components

  • The Eigenvalue Criterion recommended 3 components, but did not absolutely reject the 4 component. Also, for small numbers of variables, this criterion can underestimate the best number of components to extract.

  • The Proportion of Variance Explained Criterion stated that we needed to use 4 components if we wanted to account for >90% of the variability.

  • The Scree Plot Criterion said not to exceed 4 components.

  • The Minimum Communality Criterion stated that, if we wanted to keep housing median age in the analysis, we had to extract 4 components.

  • Conclusion: PC4 is included.

  • Test Dataset

    • We can perform PCA on Test dataset also and that should provide us similar pattern of PC1 to PC4 in terms of eigenvectors and loadings (not exactly same but similar). This should be taken as the confirmation that PCA of Training Data can be applied on Test Data
    • If the results differ too much then it should be taken as the indication of traning data being not a representation of test data.

37.20 Factor Analysis

Factor Analysis (FA) is related to PCA but the two methods have different goals.

Principal components seek to identify orthogonal linear combinations of the variables, to be used either for descriptive purposes or to substitute a smaller number of uncorrelated components for the original variables.

In contrast, factor analysis represents a model for the data, and as such is more elaborate.

The factor analysis model hypothesizes that the response vector \({\{X_1, X_2, \ldots, X_m\}}\) can be modeled as linear combinations of a smaller set of \({k}\) unobserved, “latent” random variables \({\{F_1, F_2, \ldots, F_k\}}\) called common factors, along with an error term \(\mathbf{\epsilon} = {\{\epsilon_1, \epsilon_2, \ldots, \epsilon_m\}}\). Specifically, the factor analysis model is :

\[\underset{m \times 1}{\mathbf{X - \mu}} = \underset{m \times k}{\mathbf{L}} \, \underset{k \times 1}{\mathbf{F}} + \underset{m \times 1}{\mathbf{\epsilon}} \tag{37.1}\]

Where \(\underset{m \times 1}{\mathbf{X - \mu}}\) is the response vector, centered by the mean vector, \(\underset{m \times k}{\mathbf{L}}\) is the matrix of factor loadings, with \(l_{ij}\) representing the factor loading of the \(i^{\text{th}}\) variable on the \(j^{\text{th}}\) factor, \(\underset{k \times 1}{\mathbf{F}}\) represents the vector of unobservable common factors, and \(\underset{m \times 1}{\mathbf{\epsilon}}\) represents the error vector.

The factor analysis model differs from other models, such as the linear regression model, in that the predictor variables \({\{F_1, F_2, \ldots, F_k\}}\) are unobservable. Because so many terms are unobserved, further assumptions must be made before we may uncover the factors from the observed responses alone.

These assumptions are that \(E(\mathbf{F}) = \mathbf{0}, \text{Cov}(\mathbf{F}) = \mathbf{I}, E(\mathbf{\epsilon}) = \mathbf{0}, \text{Cov}(\mathbf{\epsilon})\) is a diagonal matrix.

Unfortunately, the factor solutions provided by factor analysis are invariant to transformations. Hence, the factors uncovered by the model are in essence nonunique, without further constraints. This indistinctness provides the motivation for factor rotation.

37.21 UCI Data Repository

37.22 Data Adult

Please import the "C34-adult.csv".

  • Source: https://archive-beta.ics.uci.edu/ml/datasets/adult
  • About: Train [32561, 15] & Test[16281, 15] = Total[48842, 15]
    • The intended task is to find the set of demographic characteristics that can best predict whether or not the individual has an income of over 50000 dollars per year.
  • Steps (External) Luke Perich
    • Merged Train and Test for easy and simultaneous cleaning. Source Column attached for easy separation later.
    • ‘fnlwgt’ (stands for final weight) - Dropped
      • It has no predictive power since it is a feature aimed to allocate similar weights to people with similar demographic characteristics.
    • ‘income’
      • Edited “.” in Text and Modified to 0 (<= 50K) and 1 (>50K) as Factor for Summary
    • ‘Education’ is dropped
      • It is just a label on ‘education_num’ (number of years of education). (Not Tested “ForLater”)
    • ‘marital_status’
      • Number of Levels reduced by merging 3 types of Married. Two of them have small count.
    • ‘age’ - num - Min 17 to Max 90 All Numbers are Present
    • ‘education_num’ - num - Min 1 to Max 16 All Numbers are Present
    • ‘hours_per_week’ - num -
      • 3 Hours are missing but that can happen
      • cannot remove 99 hours because 98 hours and other nearby hours are also present in the data
    • ‘capital_gain’ & ‘capital_loss’ - Both removed
      • After 41310 dollars, There is directly 99999 dollars which cannot be correct. It should be set to Median if it is kept
      • 44807 observations have 0 capital gain
      • 46560 observations have 0 capital loss
    • ‘workclass,’ ‘occupation,’ ‘native_country’
      • These 3 contain Question Mark which have been converted to NA but not removed from dataset to have the possibility of imputation or to keep their other column variable information for analysis
    • ‘native_country’
      • With huge bias towards US there is no point in having so many countries or regions even.
      • Changed to Binary Factors of USA and Other

Import

Processing

# #Merge Tibbles with ID Names in Column
# #NA Introduced by changing Question Mark to NA
aa <- bind_rows(Train = tbl_aa, Test = tbl_bb, .id = 'source') 
#
bb <- aa %>% 
  select(-c(fnlwgt, education, capital_gain, capital_loss)) %>% 
  mutate(Income = ifelse(Income == "<=50K" | Income == "<=50K.", "0", "1")) %>% 
  mutate(across(c(workclass, occupation, native_country), ~na_if(., "?"))) %>% 
  mutate(native_country = ifelse(str_detect(native_country, 
            paste0(c("United-", "Outlying-US"), collapse = "|")), "USA", "Other")) %>% 
  mutate(across(where(is.character), ~ factor(., levels = unique(.)))) %>% 
  mutate(marital_status = 
    fct_collapse(marital_status, 
      Married = c("Married-civ-spouse", "Married-spouse-absent", "Married-AF-spouse"))) 
#
xxC34Adult <- bb
f_setRDS(xxC34Adult)

Merge Factor Levels

# #Levels of each Factor Variable
#lapply(aa[, sapply(aa, is.factor)], levels)
#levels(aa$marital_status)
summary(aa$marital_status)
ii <- aa %>% select(marital_status) %>% 
  mutate(marital_status = 
    fct_collapse(marital_status, 
      Married = c("Married-civ-spouse", "Married-spouse-absent", "Married-AF-spouse")))
#summary(ii$marital_status)

Check Numeric

# #Check Numeric Columns by summary()
if(TRUE) aa %>% select(!where(is.numeric)) %>% summary()
if(TRUE) sort(unique(aa$age))
if(TRUE) length(sort(unique(aa$age)))
if(TRUE) aa %>% count(age) %>% mutate(PROP = n/sum(n)) #%>% arrange(desc(n)) %>% head(10)
#
# #Find Missing Numbers in a Sequence of Numbers
#ii <- unique(aa$age) 
ii <- unique(aa$hours_per_week)
jj <- min(ii):max(ii)
jj[!jj %in% ii]
#
# #Equivalent
setdiff(jj, ii)

Search String

# #To Search For Question Mark in All Factor Columns, Question Mark needs to be escaped
# #The Backslash used for escaping itself needs to be escaped using Backslash
aa %>% rowwise() %>%
  mutate(find_me = any(str_detect(c_across(where(is.factor)), 
                                  regex("\\?", ignore_case = TRUE)), na.rm = TRUE)) %>% 
  filter(find_me)
#
# #To Get the Column Names containing a String i.e. '?'
which(vapply(aa, function(x) any(stri_detect(x, regex = "\\?", max_count = 1)), logical(1)))

Replace Mutiple Partial Matches

# #Search and Replace for Multiple Partial Matches 
# #NOTE: "|" should be used to collpase NOT " | "
# #NOTE: Question Marks Replaced as Other
aa %>% mutate(native_country = ifelse(str_detect(native_country, 
            paste0(c("United-", "Outlying-US"), collapse = "|")), "USA", "Other")) %>% 
  count(native_country)

Check Factor

# #Check Factor Columns by summary()
if(TRUE) aa %>% select(!where(is.factor)) %>% summary()
ii <- factor(aa$native_country)
if(TRUE) levels(ii)
if(TRUE) nlevels(ii)
aa %>% count(native_country) #%>% tail(10) 

37.23 Correlation Matrix

Note that the correlations, although statistically significant in several cases, are overall much weaker than the correlations from the ‘houses’ data set. A weaker correlation structure should pose more of a challenge for the dimension-reduction method.

NOTE: While the Book created ‘Net Captial,’ I have skipped that because Capital Gain and Capital Loss Columns have extremely high number of zeroes. Further, an ID column ‘fnlwgt’ was also dropped.

Table 37.8: (C34T08) Adult: Correlation Matrix
age edu hours
age
edu 0.0363
hours 0.0699 0.15

Factor analysis requires a certain level of correlation in order to function appropriately. The following tests have been developed to ascertain whether there exists sufficiently high correlation to perform factor analysis.

  • Note, however, that statistical tests in the context of huge data sets can be misleading. With huge sample sizes, even the smallest effect sizes become statistically significant. This is why data mining methods rely on cross-validation methodologies, not statistical inference.

  • The proportion of variability within the standardized predictor variables which is shared in common, and therefore might be caused by underlying factors, is measured by the Kaiser–Meyer–Olkin (KMO) Measure of Sampling Adequacy.

    • Values of the KMO statistic less than 0.50 indicate that factor analysis may not be appropriate.
  • Bartlett Test of Sphericity tests the null hypothesis that the correlation matrix is an identity matrix, that is, that the variables are really uncorrelated.

  • The statistic reported is the p-value, so that very small values would indicate evidence against the null hypothesis, that is, the variables really are correlated.

  • For p-values much larger than 0.10, there is insufficient evidence that the variables are correlated, and so factor analysis may not be suitable.

  • It compares an observed correlation matrix to the identity matrix.

    • Essentially it checks to see if there is a certain redundancy between the variables that we can summarize with a few number of factors.
    • The null hypothesis of the test is that the variables are orthogonal, i.e. not correlated.
    • If the numbers in the matrix represent correlation coefficients, like Identity Matrix, it means that each variable is perfectly orthogonal (i.e. “uncorrelated”) to every other variable and thus a data reduction technique like PCA or factor analysis would not be able to “compress” the data in any meaningful way.
  • The KMO statistic has a value of 0.52, which is not less than 0.5, meaning that this test does not find the level of correlation to be too low for factor analysis.

  • The p-value for Bartlett Test of Sphericity rounds to zero, so that the null hypothesis that no correlation exists among the variables is rejected. We therefore proceed with the factor analysis.

37.24 KMO Test

# #KMO Test: Measure of Sampling Adequacy (MSA) 
KMO(cor(adl_zw))
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(adl_zw))
## Overall MSA =  0.52
## MSA for each item = 
##   age   edu hours 
##  0.56  0.51  0.51

37.25 Bartlett Test of Sphericity

  • Caution: It is NOT same as ‘Bartlett Test for Equality of Variances’
  • This tests requires multivariate normality. If this condition is not met, KMO can still be used.
# #Bartlett Test of Sphericity
bartsph <- cortest.bartlett(cor(adl_zw), n = nrow(adl_zw))
if(bartsph$p.value < 0.05) {
  cat("Null Rejected. Variables are Correlated. Dimension Reduction can be performed.\n")
} else {
  cat("Failed to reject the Null. Uncorrelated Variables. No benefit in Dimension Reduction.\n")
}
## Null Rejected. Variables are Correlated. Dimension Reduction can be performed.

37.26 FA

  • To allow us to view the results using a scatter plot, we decide a priori to extract only two factors.

  • The following factor analysis is performed using the principal axis factoring option.

    • In principal axis factoring, an iterative procedure is used to estimate the communalities and the factor solution.
    • This particular analysis required 152 such iterations before reaching convergence.
    • The eigenvalues and the proportions of the variance explained by each factor are shown in Table
  • fa()

    • Trying with ‘pa’ as given in Book, even though ‘pa’ produces warnings whereas ‘minres’ does not
  • Warning: The estimated weights for the factor scores are probably incorrect. Try a different factor score estimation method.

    • It is coming most probably because age and edu might be serially correlated i.e. 
      • As the Age is increasing, education might also increasing with a lag. Age values might be shifted forward in time with Education values
  • Warning: An ultra-Heywood case was detected. Examine the results carefully

    • communality > 1
    • Heywood cases should be treated as invalid.
    • Try to reduce the number of factors, try other initial communalities (in PAF method), try to drop variables with low KMO, check multicollinearity
    • These are encountered typically when there are too few variables to support the requested number of factors.
      • Both Warnings go away when requested factors were reduced from 3 to 2
ERROR 37.1 Error in if (prod(R2) < 0) : missing value where TRUE/FALSE needed
  • Add ‘SMC = FALSE’ in fa()

  • NOTE

    • Cummulative Variance is only 48% i.e. less than half is explained by first two factors extracted.
      • In contrast, the Housing data had ~76% explained by first two factors because there the correlation was strong

pa

adl_fa <- fa(adl_zw, nfactors = 2, #ncol(adl_zw)
   fm = "pa", rotate = "none", SMC = FALSE)
# #Loadings
adl_fa$loadings
## 
## Loadings:
##       PA1    PA2   
## age    0.983       
## edu           0.633
## hours         0.225
## 
##                  PA1   PA2
## SS loadings    0.982 0.457
## Proportion Var 0.327 0.152
## Cumulative Var 0.327 0.480
#
# #Values
adl_fa$values
## [1] 0.9817285949 0.4567784008 0.0007492974
#
# #Communalities
adl_fa$communalities
## [1] 1.438507

pa

# #Warnings
# #The estimated weights for the factor scores are probably incorrect.  
# #Try a different factor score estimation method.
# #An ultra-Heywood case was detected.  Examine the results carefully
fa(adl_zw, nfactors = 2, #ncol(adl_zw)
   fm = "pa", rotate = "none", SMC = FALSE)
## Factor Analysis using method =  pa
## Call: fa(r = adl_zw, nfactors = 2, rotate = "none", SMC = FALSE, fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
##        PA1   PA2    h2    u2 com
## age   0.98 -0.08 0.972 0.028 1.0
## edu   0.09  0.63 0.408 0.592 1.0
## hours 0.09  0.23 0.058 0.942 1.3
## 
##                        PA1  PA2
## SS loadings           0.98 0.46
## Proportion Var        0.33 0.15
## Cumulative Var        0.33 0.48
## Proportion Explained  0.68 0.32
## Cumulative Proportion 0.68 1.00
## 
## Mean item complexity =  1.1
## Test of the hypothesis that 2 factors are sufficient.
## 
## The degrees of freedom for the null model are  3  and the objective function was  0.03 with Chi Square of  706.35
## The degrees of freedom for the model are -2  and the objective function was  0 
## 
## The root mean square of the residuals (RMSR) is  0 
## The df corrected root mean square of the residuals is  NA 
## 
## The harmonic number of observations is  25000 with the empirical chi square  0  with prob <  NA 
## The total number of observations was  25000  with Likelihood Chi Square =  0  with prob <  NA 
## 
## Tucker Lewis Index of factoring reliability =  1.004
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    PA1   PA2
## Correlation of (regression) scores with factors   0.98  0.66
## Multiple R square of scores with factors          0.97  0.43
## Minimum correlation of possible factor scores     0.94 -0.14

With cor()

# #Warnings
# #The estimated weights for the factor scores are probably incorrect.  
# #Try a different factor score estimation method.
# #An ultra-Heywood case was detected.  Examine the results carefully
fa(cor(adl_zw), nfactors = 2, #ncol(adl_zw)
   fm = "pa", rotate = "none", n.obs = 25000, SMC = FALSE)
## Factor Analysis using method =  pa
## Call: fa(r = cor(adl_zw), nfactors = 2, n.obs = 25000, rotate = "none", 
##     SMC = FALSE, fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
##        PA1   PA2    h2    u2 com
## age   0.98 -0.08 0.972 0.028 1.0
## edu   0.09  0.63 0.408 0.592 1.0
## hours 0.09  0.23 0.058 0.942 1.3
## 
##                        PA1  PA2
## SS loadings           0.98 0.46
## Proportion Var        0.33 0.15
## Cumulative Var        0.33 0.48
## Proportion Explained  0.68 0.32
## Cumulative Proportion 0.68 1.00
## 
## Mean item complexity =  1.1
## Test of the hypothesis that 2 factors are sufficient.
## 
## The degrees of freedom for the null model are  3  and the objective function was  0.03 with Chi Square of  706.35
## The degrees of freedom for the model are -2  and the objective function was  0 
## 
## The root mean square of the residuals (RMSR) is  0 
## The df corrected root mean square of the residuals is  NA 
## 
## The harmonic number of observations is  25000 with the empirical chi square  0  with prob <  NA 
## The total number of observations was  25000  with Likelihood Chi Square =  0  with prob <  NA 
## 
## Tucker Lewis Index of factoring reliability =  1.004
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    PA1   PA2
## Correlation of (regression) scores with factors   0.98  0.66
## Multiple R square of scores with factors          0.97  0.43
## Minimum correlation of possible factor scores     0.94 -0.14

minres

# #No Warning
fa(adl_zw, nfactors = ncol(adl_zw), fm = "minres", rotate = "none")
## Factor Analysis using method =  minres
## Call: fa(r = adl_zw, nfactors = ncol(adl_zw), rotate = "none", fm = "minres")
## Standardized loadings (pattern matrix) based upon correlation matrix
##        MR1   MR2 MR3    h2   u2 com
## age   0.15  0.14   0 0.042 0.96 2.0
## edu   0.36 -0.12   0 0.144 0.86 1.2
## hours 0.43  0.05   0 0.187 0.81 1.0
## 
##                        MR1  MR2  MR3
## SS loadings           0.34 0.04 0.00
## Proportion Var        0.11 0.01 0.00
## Cumulative Var        0.11 0.12 0.12
## Proportion Explained  0.90 0.10 0.00
## Cumulative Proportion 0.90 1.00 1.00
## 
## Mean item complexity =  1.4
## Test of the hypothesis that 3 factors are sufficient.
## 
## The degrees of freedom for the null model are  3  and the objective function was  0.03 with Chi Square of  706.35
## The degrees of freedom for the model are -3  and the objective function was  0 
## 
## The root mean square of the residuals (RMSR) is  0 
## The df corrected root mean square of the residuals is  NA 
## 
## The harmonic number of observations is  25000 with the empirical chi square  0  with prob <  NA 
## The total number of observations was  25000  with Likelihood Chi Square =  0  with prob <  NA 
## 
## Tucker Lewis Index of factoring reliability =  1.004
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                     MR1   MR2 MR3
## Correlation of (regression) scores with factors    0.54  0.20   0
## Multiple R square of scores with factors           0.29  0.04   0
## Minimum correlation of possible factor scores     -0.43 -0.92  -1

37.27 Factor Rotation

To assist in the interpretation of the factors, factor rotation may be performed. Factor rotation corresponds to a transformation (usually orthogonal) of the coordinate axes, leading to a different set of factor loadings.

The sharpest focus occurs when each variable has high factor loadings on a single factor, with low-to-moderate loadings on the other factors.

  • “ForLater” Figure 4.6 Page 115 - Biplot

  • No significant difference was observed with different Rotations, unlike BOOK.

# #Huge number of Rotations are available including "none", "varimax", "quartimax", "equamax" ...
# #No significant difference observed
fa(adl_zw, nfactors = 2, fm = "pa", rotate = "none", SMC = FALSE)$loadings
## 
## Loadings:
##       PA1    PA2   
## age    0.983       
## edu           0.633
## hours         0.225
## 
##                  PA1   PA2
## SS loadings    0.982 0.457
## Proportion Var 0.327 0.152
## Cumulative Var 0.327 0.480
#
fa(adl_zw, nfactors = 2, fm = "pa", rotate = "varimax", SMC = FALSE)$loadings
## 
## Loadings:
##       PA1    PA2   
## age    0.982       
## edu           0.638
## hours         0.236
## 
##                  PA1   PA2
## SS loadings    0.968 0.471
## Proportion Var 0.323 0.157
## Cumulative Var 0.323 0.480
#
fa(adl_zw, nfactors = 2, fm = "pa", rotate = "quartimax", SMC = FALSE)$loadings
## 
## Loadings:
##       PA1   PA2  
## age   0.986      
## edu         0.638
## hours       0.232
## 
##                  PA1   PA2
## SS loadings    0.978 0.461
## Proportion Var 0.326 0.154
## Cumulative Var 0.326 0.480
#
fa(adl_zw, nfactors = 2, fm = "pa", rotate = "equamax", SMC = FALSE)$loadings
## 
## Loadings:
##       PA1   PA2  
## age   0.986      
## edu         0.638
## hours       0.232
## 
##                  PA1   PA2
## SS loadings    0.977 0.461
## Proportion Var 0.326 0.154
## Cumulative Var 0.326 0.480

37.28 User Defined Composites

  • User Defined Composites or Summated Scales
    • A user-defined composite is simply a linear combination of the variables, which combines several variables together into a single composite measure.
    • The simplest user-defined composite is simply the mean of the variables.
    • When compared to the use of individual variables, user-defined composites provide a way to diminish the effect of measurement error.
      • Measurement error refers to the disparity between the observed variable values, and the “true” variable value. Measurement error contributes to the background error noise, interfering with the ability of models to accurately process the signal provided by the data, with the result that truly significant relationships may be missed.
      • User-defined composites reduce measurement error by combining multiple variables into a single measure.
    • Appropriately constructed user-defined composites allow the analyst to represent the manifold aspects of a particular concept using a single measure.
      • Thus, user-defined composites enable the analyst to embrace the range of model characteristics, while retaining the benefits of a parsimonious model.

Validation


38 Model Data

38.1 Overview

  • “Preparing to Model the Data”

38.2 Data Mining

  • Data mining methods may be categorized as either supervised or unsupervised.
    • Most data mining methods are supervised methods.
    • Unsupervised : Clustering
    • Supervised : Regression, decision trees, neural networks, k-nearest neighbors
    • Another data mining method, which may be supervised or unsupervised, is association rule mining.
Definition 38.1 In unsupervised methods, no target variable is identified as such. Instead, the data mining algorithm searches for patterns and structures among all the variables. The most common unsupervised data mining method is clustering. Ex: Voter Profile.
Definition 38.2 Supervised methods are those in which there is a particular prespecified target variable and the algorithm is given many examples where the value of the target variable is provided. This allows the algorithm to learn which values of the target variable are associated with which values of the predictor variables.

38.3 Statistical Inference vs. Data Mining

  • Statistical methodology and data mining methodology differ in the following two ways:
    • Applying statistical inference using the huge sample sizes encountered in data mining tends to result in statistical significance, even when the results are not of practical significance.
    • In statistical methodology, the data analyst has an a priori hypothesis in mind. Data mining procedures usually do not have an a priori hypothesis.
Definition 38.3 An a priori hypothesis is one that is generated prior to a research study taking place.

38.4 Cross-validation

Definition 38.4 Cross-validation is a technique for insuring that the results uncovered in an analysis are generalizable to an independent, unseen, data set.
  • In data mining, the most common methods are twofold cross-validation and k-fold cross-validation.
    • In twofold cross-validation, the data are partitioned, using random assignment, into a training data set and a test data set. The test data set should then have the target variable omitted. Thus, the only systematic difference between the training data set and the test data set is that the training data includes the target variable and the test data does not.
    • A provisional data mining model is then constructed using the training samples provided in the training data set.
    • However, the algorithm needs to guard against "memorizing" the training set and blindly applying all patterns found in the training set to the future data. Ex: Just because all people named ‘David’ in the training set are in the high income bracket, it may not be True for all people in general.
    • Therefore, the next step is to examine how the provisional model performs on a test set of data. In the test set the provisional model performs classification according to the patterns and structures it learned from the training set.
    • The efficacy of the classifications is then evaluated by comparing them against the true values of the target variable.
    • The provisional model is then adjusted to minimize the error rate on the test set.
  • We must insure that the training and test data sets are independent, by validating the partition.
    • By performing graphical and statistical comparisons between the two sets.
    • For example, we may find that, even though the assignment of records was made randomly, a significantly higher proportion of positive values of an important flag variable were assigned to the training set, compared to the test set. This would bias our results.
    • It is especially important that the characteristics of the target variable be as similar as possible between the training and test data sets.
    • Hypothesis tests for validating the target variable, based on the type of target variable: t-test (for difference in means), z-test (for difference in proportions), test for homogeneity of proportions
Definition 38.5 Overfitting results when the provisional model tries to account for every possible trend or structure in the training set, even idiosyncratic ones such as the ‘David’ example above.
  • Increasing the complexity of the model in order to increase the accuracy on the training set eventually and inevitably leads to a degradation in the generalizability of the provisional model to the test set.
    • As the model complexity increases, the error rate on the training set continues to fall in a monotone manner.
    • However, as the model complexity increases, the test set error rate soon begins to flatten out and increase because the provisional model has memorized the training set rather than leaving room for generalizing to unseen data.
  • The low complexity model suffers from some classification errors. The classification errors can be reduced by a more complex model.
    • We might be tempted to adopt the greater complexity in order to reduce the error rate.
    • However, we should be careful not to depend on the idiosyncrasies of the training set.
    • The low-complexity model need not change very much to accommodate new data points. i.e. low-complexity model has low variance.
    • However, the high-complexity model must alter considerably if it is to maintain its low error rate. i.e. high-complexity model has a high variance.
(#def:Bias–variance-Trade-off) Even though the high-complexity model has low bias (error rate), it has a high variance; and even though the low-complexity model has a high bias, it has a low variance. This is known as the bias–variance trade-off. The bias–variance trade-off is another way of describing the overfitting-underfitting dilemma.
  • The goal is to construct a model in which neither the bias nor the variance is too high
    • A common method of evaluating how accurate model estimation is proceeding for a continuous target variable is to use the mean-squared error (MSE). (Target: Low MSE)
      • MSE is a good evaluative measure because it combines both bias and variance.
      • The MSE is a function of the estimation error (sum of squared errors, SSE) and the model complexity (e.g., degrees of freedom).

Validation


References

Daniel T. Larose, Chantal D. Larose. 2015. Data Mining and Predictive Analytics. Second Edition. Danvers, MA 01923 USA: Wiley. https://www.wiley.com.
David R. Anderson, Thomas A. Williams, Dennis J. Sweeney. 2018. Statistics for Business and Economics. Revised 13e. Boston, MA 02210 USA: Cengage Learning. https://www.cengage.com.

Glossary

THEOREMS

DEFINITIONS

1.1: Vectors

Vectors are the simplest type of data structure in R. A vector is a sequence of data elements of the same basic type.

1.2: Components

Members of a vector are called components.

1.3: Packages

Packages are the fundamental units of reproducible R code.

2.1: R-Markdown

R Markdown is a file format for making dynamic documents with R.

2.2: NA

NA is a logical constant of length 1 which contains a missing value indicator.

2.3: Factors

Factors are the data objects which are used to categorize the data and store it as levels.

2.4: Lists

Lists are by far the most flexible data structure in R. They can be seen as a collection of elements without any restriction on the class, length or structure of each element.

2.5: DataFrame

Data Frames are lists with restriction that all elements of a data frame are of equal length.

7.1: H-Variances

\(\text{\{Variances\}} {H_0} : {\sigma}_1 = {\sigma}_2 = \dots = {\sigma}_k \iff {H_a}: \text{At least two variances differ.}\)

8.1: Imputation

Imputation is the process of replacing missing data with substituted values. Imputation preserves all cases by replacing missing data with an estimated value based on other available information.

11.1: Linear-Regression

Linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables.

16.1: Data

Data are the facts and figures collected, analysed, and summarised for presentation and interpretation.

16.2: Elements

Elements are the entities on which data are collected. (Generally ROWS)

16.3: Variable

A variable is a characteristic of interest for the elements. (Generally COLUMNS)

16.4: Observation

The set of measurements obtained for a particular element is called an observation.

16.5: Statistics

Statistics is the art and science of collecting, analysing, presenting, and interpreting data.

16.6: Scale-of-Measurement

The scale of measurement determines the amount of information contained in the data and indicates the most appropriate data summarization and statistical analyses.

16.7: Nominal-Scale

When the data for a variable consist of labels or names used to identify an attribute of the element, the scale of measurement is considered a nominal scale.

16.8: Ordinal-Scale

The scale of measurement for a variable is considered an ordinal scale if the data exhibit the properties of nominal data and in addition, the order or rank of the data is meaningful.

16.9: Interval-Scale

The scale of measurement for a variable is an interval scale if the data have all the properties of ordinal data and the interval between values is expressed in terms of a fixed unit of measure.

16.10: Ratio-Scale

The scale of measurement for a variable is a ratio scale if the data have all the properties of interval data and the ratio of two values is meaningful.

16.11: Categorical-Data

Data that can be grouped by specific categories are referred to as categorical data. Categorical data use either the nominal or ordinal scale of measurement.

16.12: Quantitative-Data

Data that use numeric values to indicate ‘how much’ or ‘how many’ are referred to as quantitative data. Quantitative data are obtained using either the interval or ratio scale of measurement.

16.13: Discrete

Quantitative data that measure ‘how many’ are discrete.

16.14: Continuous

Quantitative data that measure ‘how much’ are continuous because no separation occurs between the possible data values.

16.15: Cross-Sectional-Data

Cross-sectional data are data collected at the same or approximately the same point in time.

16.16: Time-Series-Data

Time-series data data are data collected over several time periods.

16.17: Observational-Study

In an observational study we simply observe what is happening in a particular situation, record data on one or more variables of interest, and conduct a statistical analysis of the resulting data.

16.18: Experiment

The key difference between an observational study and an experiment is that an experiment is conducted under controlled conditions.

16.19: Descriptive-Statistics

Most of the statistical information is summarized and presented in a form that is easy to understand. Such summaries of data, which may be tabular, graphical, or numerical, are referred to as descriptive statistics.

16.20: Population

A population is the set of all elements of interest in a particular study.

16.21: Sample

A sample is a subset of the population.

16.22: Parameter-vs-Statistic

The measurable quality or characteristic is called a Population Parameter if it is computed from the population. It is called a Sample Statistic if it is computed from a sample.

16.23: Census

The process of conducting a survey to collect data for the entire population is called a census.

16.24: Sample-Survey

The process of conducting a survey to collect data for a sample is called a sample survey.

16.25: Statistical-Inference

Statistics uses data from a sample to make estimates and test hypotheses about the characteristics of a population through a process referred to as statistical inference.

16.26: Analytics

Analytics is the scientific process of transforming data into insight for making better decisions.

16.27: Descriptive-Analytics

Descriptive analytics encompasses the set of analytical techniques that describe what has happened in the past.

16.28: Predictive-Analytics-301

Predictive analytics consists of analytical techniques that use models constructed from past data to predict the future or to assess the impact of one variable on another.

16.29: Prescriptive-Analytics

Prescriptive analytics is the set of analytical techniques that yield a best course of action.

16.30: Big-Data

Larger and more complex data sets are now often referred to as big data.

16.31: Data-Mining-301

Data Mining deals with methods for developing useful decision-making information from large databases. It can be defined as the automated extraction of predictive information from (large) databases.

17.1: Frequency-Distribution

A frequency distribution is a tabular summary of data showing the number (frequency) of observations in each of several non-overlapping categories or classes.

17.2: Cross-Tab

A crosstabulation is a tabular summary of data for two variables. It is used to investigate the relationship between them. Generally, one of the variable is categorical.

18.1: Number

A number is a mathematical object used to count, measure, and label. Their study or usage is called arithmetic, a term which may also refer to number theory, the study of the properties of numbers.

18.2: Prime

A prime number is a natural number greater than 1 that is not a product of two smaller natural numbers. A natural number greater than 1 that is not prime is called a ‘composite number.’ 1 is neither a Prime nor a composite, it is a ‘Unit.’ Thus, by definition, Negative Integers and Zero cannot be Prime.

18.3: Parity-Odd-Even

Parity is the property of an integer \(\mathbb{Z}\) of whether it is even or odd. It is even if the integer is divisible by 2 with no remainders left and it is odd otherwise. Thus, -2, 0, +2 are even but -1, 1 are odd. Numbers ending with 0, 2, 4, 6, 8 are even. Numbers ending with 1, 3, 5, 7, 9 are odd.

18.4: Positive-Negative

An integer \(\mathbb{Z}\) is positive if it is greater than zero, and negative if it is less than zero. Zero is defined as neither negative nor positive.

18.5: Mersenne-Primes

Mersenne primes are those prime number that are of the form \((2^n -1)\); that is, \(\{3, 7, 31, 127, \ldots \}\)

18.6: Measures-of-Location

Measures of location are numerical summaries that indicate where on a number line a certain characteristic of the variable lies. Examples of the measures of location are percentiles and quantiles.

18.7: Measures-of-Center

The measures of center are a special case of measures of location. These estimate where the center of a particular variable lies. Most common are Mean, Median, and Mode.

18.8: Mean

Given a data set \({X = \{{x}_1, {x}_2, \ldots, {x}_n\}}\), the mean \({\overline{x}}\) is the sum of all of the values \({{x}_1, {x}_2, \ldots, {x}_n}\) divided by the count \({n}\).

18.9: Median

Median of a population is any value such that at most half of the population is less than the proposed median and at most half is greater than the proposed median.

18.10: Geometric-Mean

The geometric mean \(\overline{x}_g\) is a measure of location that is calculated by finding the n^{th} root of the product of \({n}\) values.

18.11: Mode

The mode is the value that occurs with greatest frequency.

18.12: Percentile

A percentile provides information about how the data are spread over the interval from the smallest value to the largest value. For a data set containing \({n}\) observations, the \(p^{th}\) percentile divides the data into two parts: approximately p% of the observations are less than the \(p^{th}\) percentile, and approximately (100 – p)% of the observations are greater than the \(p^{th}\) percentile.

18.13: Measures-of-Spread

Measures of spread (or the measures of variability) describe how spread out the data values are. Examples are Range, SD, mean absolute deviation, and IQR

18.14: Variance

The variance \(({\sigma}^2)\) is based on the difference between the value of each observation \({x_i}\) and the mean \({\overline{x}}\). The average of the squared deviations is called the variance.

18.15: Standard-Deviation

The standard deviation (\(s, \sigma\)) is defined to be the positive square root of the variance. It is a measure of the amount of variation or dispersion of a set of values.

18.16: Skewness

Skewness \((\tilde{\mu}_{3})\) is a measure of the shape of a data distribution. It is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.

18.17: Tails

A tail refers to the tapering sides at either end of a distribution curve.

18.18: Kurtosis

Kurtosis \((\tilde{\mu}_{4})\) is a measure of the “tailedness” of the probability distribution of a real-valued random variable. Like skewness, kurtosis describes the shape of a probability distribution. For \({\mathcal{N}}_{(\mu, \, \sigma)}\), kurtosis is 3 and excess kurtosis is 0 (i.e. subtract 3).

18.19: TheSample

A sample of \({n}\) observations given by \({X = \{{x}_1, {x}_2, \ldots, {x}_n\}}\) have a sample mean \({\overline{x}}\) and the sample standard deviation, \({s}\).

18.20: z-Scores

The z-score, \({z_i}\), can be interpreted as the number of standard deviations \({x_i}\) is from the mean \({\overline{x}}\). It is associated with each \({x_i}\). The z-score is often called the standardized value or standard score.

18.21: t-statistic

Computing a z-score requires knowing the mean \({\mu}\) and standard deviation \({\sigma}\) of the complete population to which a data point belongs. If one only has a sample of observations from the population, then the analogous computation with sample mean \({\overline{x}}\) and sample standard deviation \({s}\) yields the t-statistic.

18.22: Chebyshev-Theorem

Chebyshev Theorem can be used to make statements about the proportion of data values that must be within a specified number of standard deviations \({\sigma}\), of the mean \({\mu}\).

18.23: Empirical-Rule

Empirical rule is used to compute the percentage of data values that must be within one, two, and three standard deviations \({\sigma}\) of the mean \({\mu}\) for a normal distribution. These probabilities are Pr(x) 68.27%, 95.45%, and 99.73%.

18.24: Outliers

Sometimes unusually large or unusually small values are called outliers. It is a data point that differs significantly from other observations.

18.25: Covariance

Covariance is a measure of linear association between two variables. Positive values indicate a positive relationship; negative values indicate a negative relationship.

18.26: Correlation-Coefficient

Correlation coefficient is a measure of linear association between two variables that takes on values between -1 and +1. Values near +1 indicate a strong positive linear relationship; values near -1 indicate a strong negative linear relationship; and values near zero indicate the lack of a linear relationship.

19.1: Probability

Probability is a numerical measure of the likelihood that an event will occur. Probability values are always assigned on a scale from 0 to 1. A probability near zero indicates an event is unlikely to occur; a probability near 1 indicates an event is almost certain to occur.

19.2: Random-Experiment

A random experiment is a process that generates well-defined experimental outcomes. On any single repetition or trial, the outcome that occurs is determined completely by chance.

19.3: Sample-Space

The sample space for a random experiment is the set of all experimental outcomes.

19.4: Counting-Rule

Counting Rule for Multiple-Step Experiments: If an experiment can be described as a sequence of \({k}\) steps with \({n_1}\) possible outcomes on the first step, \({n_2}\) possible outcomes on the second step, and so on, then the total number of experimental outcomes is given by \(\{(n_1)(n_2) \cdots (n_k) \}\)

19.5: Tree-Diagram

A tree diagram is a graphical representation that helps in visualizing a multiple-step experiment.

19.6: Factorial

The factorial of a non-negative integer \({n}\), denoted by \(n!\), is the product of all positive integers less than or equal to n. The value of 0! is 1 i.e. \(0!=1\)

19.7: Combinations

Combination allows one to count the number of experimental outcomes when the experiment involves selecting \({k}\) objects from a set of \({N}\) objects. The number of combinations of \({N}\) objects taken \({k}\) at a time is equal to the binomial coefficient \(C_k^N\)

19.8: Permutations

Permutation allows one to compute the number of experimental outcomes when \({k}\) objects are to be selected from a set of \({N}\) objects where the order of selection is important. The same \({k}\) objects selected in a different order are considered a different experimental outcome. The number of permutations of \({N}\) objects taken \({k}\) at a time is given by \(P_k^N\)

19.9: Event

An event is a collection of sample points. The probability of any event is equal to the sum of the probabilities of the sample points in the event. The sample space, \({s}\), is an event. Because it contains all the experimental outcomes, it has a probability of 1; that is, \(P(S) = 1\)

19.10: Complement

Given an event \({A}\), the complement of A (\(A^c\)) is defined to be the event consisting of all sample points that are not in A. Thus, \(P(A) + P(A^{c}) =1\)

19.11: Union

Given two events A and B, the union of A and B is the event containing all sample points belonging to A or B or both. The union is denoted by \(A \cup B\)

19.12: Intersection

Given two events A and B, the intersection of A and B is the event containing the sample points belonging to both A and B. The intersection is denoted by \(A \cap B\)

19.13: Mutually-Exclusive

Two events are said to be mutually exclusive if the events have no sample points in common. Thus, \(A \cap B = 0\)

19.14: Conditional-Probability

Conditional probability is the probability of an event given that another event already

19.14: Conditional-Probability

occurred. The conditional probability of ‘A given B’ is \(P(A|B) = \frac{P(A \cup B)}{P(B)}\)

19.15: Events-Independent

Two events A and B are independent if \(P(A|B) = P(A) \quad \text{OR} \quad P(B|A) = P(B) \Rightarrow P(A \cap B) = P(A) \cdot P(B)\)

20.1: Random-Variable

A random variable is a numerical description of the outcome of an experiment. Random variables must assume numerical values. It can be either ‘discrete’ or ‘continuous.’

20.2: Discrete-Random-Variable

A random variable that may assume either a finite number of values or an infinite sequence of values such as \(0, 1, 2, \dots\) is referred to as a discrete random variable. It includes factor type i.e. Male as 0, Female as 1 etc.

20.3: Continuous-Random-Variable

A random variable that may assume any numerical value in an interval or collection of intervals is called a continuous random variable. It is given by \(x \in [n, m]\). If the entire line segment between the two points also represents possible values for the random variable, then the random variable is continuous.

20.4: Probability-Distribution

The probability distribution for a random variable describes how probabilities are distributed over the values of the random variable.

20.5: Probability-Function

For a discrete random variable x, a probability function \(f(x)\), provides the probability for each value of the random variable.

20.6: Expected-Value-Discrete

The expected value, or mean, of a random variable is a measure of the central location for the random variable. i.e. \(E(x) = \mu = \sum xf(x)\)

20.7: Variance-Discrete

The variance is a weighted average of the squared deviations of a random variable from its mean. The weights are the probabilities. i.e. \(\text{Var}(x) = \sigma^2 = \sum \{(x- \mu)^2 \cdot f(x)\}\)

20.8: Bivariate

A probability distribution involving two random variables is called a bivariate probability distribution. A discrete bivariate probability distribution provides a probability for each pair of values that may occur for the two random variables.

21.1: Uniform-Probability-Distribution

Uniform probability distribution is a continuous probability distribution for which the probability that the random variable will assume a value in any interval is the same for each interval of equal length. Whenever the probability is proportional to the length of the interval, the random variable is uniformly distributed.

21.2: Probability-Density-Function

The probability that the continuous random variable \({x}\) takes a value between \([a, b]\) is given by the area under the graph of probability density function \(f(x)\); that is, \(A = \int _{a}^{b}f(x)\ dx\). Note that \(f(x)\) can be greater than 1, however its integral must be equal to 1.

21.3: Normal-Distribution

A normal distribution (\({\mathcal{N}}_{({\mu}, \, {\sigma}^2)}\)) is a type of continuous probability distribution for a real-valued random variable.

21.4: Standard-Normal

A random variable that has a normal distribution with a mean of zero \(({\mu} = 0)\) and a standard deviation of one \(({\sigma} = 1)\) is said to have a standard normal probability distribution. The z-distribution is given by \({\mathcal{z}}_{({\mu} = 0, \, {\sigma} = 1)}\)

22.1: Sampled-Population

The sampled population is the population from which the sample is drawn.

22.2: Frame

Frame is a list of the elements that the sample will be selected from.

22.3: Target-Population

The target population is the population we want to make inferences about. Generally (adn preferably), it will be same as ‘Sampled-Population,’ but it may differ also.

22.4: SRS

A simple random sample (SRS) is a set of \({k}\) objects in a population of \({N}\) objects where all possible samples are equally likely to happen. The number of such different simple random samples is \(C_k^N\)

22.5: Sampling-without-Replacement

Sampling without replacement: Once an element has been included in the sample, it is removed from the population and cannot be selected a second time.

22.6: Sampling-with-Replacement

Sampling with replacement: Once an element has been included in the sample, it is returned to the population. A previously selected element can be selected again and therefore may appear in the sample more than once.

22.7: Random-Sample

A random sample of size \({n}\) from an infinite population is a sample selected such that the following two conditions are satisfied. Each element selected comes from the same population. Each element is selected independently. The second condition prevents selection bias.

22.8: Proportion

A population proportion \({P}\), is a parameter that describes a percentage value associated with a population. It is given by \(P = \frac{X}{N}\), where \({x}\) is the count of successes in the population, and \({N}\) is the size of the population. It is estimated through sample proportion \(\overline{p} = \frac{x}{n}\), where \({x}\) is the count of successes in the sample, and \({N}\) is the size of the sample obtained from the population.

22.9: Point-Estimation

To estimate the value of a population parameter, we compute a corresponding characteristic of the sample, referred to as a sample statistic. This process is called point estimation.

22.10: Point-Estimator

A sample statistic is the point estimator of the corresponding population parameter. For example, \(\overline{x}, s, s^2, s_{xy}, r_{xy}\) sample statics are point estimators for corresponding population parameters of \({\mu}\) (mean), \({\sigma}\) (standard deviation), \(\sigma^2\) (variance), \(\sigma_{xy}\) (covariance), \(\rho_{xy}\) (correlation)

22.11: Point-Estimate

The numerical value obtained for the sample statistic is called the point estimate. Estimate is used for sample value only, for population value it would be parameter. Estimate is a value while Estimator is a function.

22.12: Sampling-Distribution

The sampling distribution of \({\overline{x}}\) is the probability distribution of all possible values of the sample mean \({\overline{x}}\).

22.13: Standard-Error

In general, standard error \(\sigma_{\overline{x}}\) refers to the standard deviation of a point estimator. The standard error of \({\overline{x}}\) is the standard deviation of the sampling distribution of \({\overline{x}}\). It is the indicator of ‘Sampling Fluctuation.’

22.14: Sampling-Error

A sampling error is the difference between a population parameter and a sample statistic.

22.15: Central-Limit-Theorem

Central Limit Theorem: In selecting random samples of size \({n}\) from a population, the sampling distribution of the sample mean \({\overline{x}}\) can be approximated by a normal distribution as the sample size becomes large.

23.1: Interval-Estimate

Because a point estimator cannot be expected to provide the exact value of the population parameter, an interval estimate is often computed by adding and subtracting a value, called the margin of error (MOE), to the point estimate. \(\text{Interval Estimate} = \text{Point Estimate} \pm \text{MOE}_{\gamma}\)

23.2: Confidence-Interval

Confidence interval is another name for an interval estimate. Normally it is given as \(({\gamma} = 1 - {\alpha})\). Ex: 95% confidence interval

23.3: Confidence-Coefficient

The confidence level expressed as a decimal value is the confidence coefficient \(({\gamma} = 1 - {\alpha})\). i.e. 0.95 is the confidence coefficient for a 95% confidence level.

23.4: t-distribution

When \({s}\) is used to estimate \({\sigma}\), the margin of error and the interval estimate for the population mean are based on a probability distribution known as the t distribution.

23.5: Degrees-of-Freedom

The number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary. In general, the degrees of freedom of an estimate of a parameter are \((n - 1)\).

24.1: Hypothesis-Testing

Hypothesis testing is a process in which, using data from a sample, an inference is made about a population parameter or a population probability distribution.

24.2: Hypothesis-Null

Null Hypothesis \((H_0)\) is a tentative assumption about a population parameter. It is assumed True, by default, in the hypothesis testing procedure.

24.3: Hypothesis-Alternative

Alternative Hypothesis \((H_a)\) is the complement of the Null Hypothesis. It is concluded to be True, if the Null Hypothesis is rejected.

24.4: Hypothesis-1T-Lower-Tail

\(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu} \geq {\mu}_0 \iff {H_a}: {\mu} < {\mu}_0\)

24.5: Hypothesis-1T-Upper-Tail

\(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)

24.6: Hypothesis-2T-Two-Tail

\(\text{\{Two Tail Test \} } \thinspace {H_0} :{\mu} = {\mu}_0 \iff {H_a}: {\mu} \neq {\mu}_0\)

24.7: Error-Type-I

The error of rejecting \({H_0}\) when it is true, is Type I error \(({\alpha})\).

24.8: Error-Type-II

The error of accepting \({H_0}\) when it is false, is Type II error \(({\beta})\).

24.9: Level-of-Significance

The level of significance \((\alpha)\) is the probability of making a Type I error when the null hypothesis is true as an equality.

24.10: Significance-Tests

Applications of hypothesis testing that only control for the Type I error \((\alpha)\) are called significance tests.

24.11: Test-Statistic

Test statistic is a number calculated from a statistical test of a hypothesis. It shows how closely the observed data match the distribution expected under the null hypothesis of that statistical test. It helps determine whether a null hypothesis should be rejected.

24.12: Tailed-Test

A one-tailed test and a two-tailed test are alternative ways of computing the statistical significance of a parameter inferred from a data set, in terms of a test statistic.

24.13: One-Tailed-Test

One-tailed test is a hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in one tail of its sampling distribution.

24.14: 1s-known-sd

If \({\sigma}\) is known, the standard normal random variable \({z}\) is used as test statistic to determine whether \({\overline{x}}\) deviates from the hypothesized value of \({\mu}\) enough to justify rejecting the null hypothesis. Refer equation (24.1) \(\to z = \frac{\overline{x} - {\mu}_0}{{\sigma}_{\overline{x}}} = \frac{\overline{x} - {\mu}_0}{{\sigma}/\sqrt{n}}\)

24.15: Approach-p-value

The p-value approach uses the value of the test statistic \({z}\) to compute a probability called a p-value.

24.16: p-value

A p-value is a probability that provides a measure of the evidence against the null hypothesis provided by the sample. The p-value is used to determine whether the null hypothesis should be rejected. Smaller p-values indicate more evidence against \({H_0}\).

24.17: Approach-Critical-Value

The critical value approach requires that we first determine a value for the test statistic called the critical value.

24.18: Critical-Value

Critical value is the value that is compared with the test statistic to determine whether \({H_0}\) should be rejected. Significance level \({\alpha}\), or confidence level (\(1 - {\alpha}\)), dictates the critical value (\(Z\)), or critical limit. Ex: For Upper Tail Test, \(Z_{{\alpha} = 0.05} = 1.645\).

24.19: Acceptance-Region

A acceptance region (confidence interval), is a set of values for the test statistic for which the null hypothesis is accepted. i.e. if the observed test statistic is in the confidence interval then we accept the null hypothesis and reject the alternative hypothesis.

24.20: Margin-Error

The margin of error tells how far the original population means might be from the sample mean. It is given by \(Z\frac{{\sigma}}{\sqrt{n}}\)

24.21: Rejection-Region

A rejection region (critical region), is a set of values for the test statistic for which the null hypothesis is rejected. i.e. if the observed test statistic is in the critical region then we reject the null hypothesis and accept the alternative hypothesis.

24.22: Two-Tailed-Test

Two-tailed test is a hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in either tail of its sampling distribution.

24.23: Approach-p-value-Steps

p-value Approach: Form Hypothesis | Specify \({\alpha}\) | Calculate test statistic | Calculate p-value | Compare p-value with \({\alpha}\) | Interpret

24.24: 1s-unknown-sd

If \({\sigma}\) is unknown, the sampling distribution of the test statistic follows the t distribution with \((n - 1)\) degrees of freedom. Refer equation (24.3) \(\to t = \frac{{\overline{x}} - {\mu}_0}{{s}/\sqrt{n}}\)

24.25: H-1s-p-Lower

\(\text{\{Left or Lower \} }\space\thinspace {H_0} : {p} \geq {p}_0 \iff {H_a}: {p} < {p}_0\)

24.26: H-1s-p-Upper

\(\text{\{Right or Upper\} } {H_0} : {p} \leq {p}_0 \iff {H_a}: {p} > {p}_0\)

24.27: H-1s-p-Two

\(\text{\{Two Tail Test \} } \thinspace {H_0} : {p} = {p}_0 \iff {H_a}: {p} \neq {p}_0\)

24.28: Power

The probability of correctly rejecting \({H_0}\) when it is false is called the power of the test. For any particular value of \({\mu}\), the power is \(1 - \beta\).

24.29: Power-Curve

Power Curve is a graph of the probability of rejecting \({H_0}\) for all possible values of the population parameter \({\mu}\) not satisfying the null hypothesis. It provides the probability of correctly rejecting the null hypothesis.

25.1: Independent-Simple-Random-Samples

Let \({\mathcal{N}}_{({\mu}_1, \, {\sigma}_1)}\) and \({\mathcal{N}}_{({\mu}_2, \, {\sigma}_2)}\) be the two populations. To make an inference about the difference between the means \(({\mu}_1 - {\mu}_2)\), we select a simple random sample of \({n}_1\) units from population 1 and a second simple random sample of \({n}_2\) units from population 2. The two samples, taken separately and independently, are referred to as independent simple random samples.

25.2: H-2s-Lower

\(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu}_1 - {\mu}_2 \geq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 < {D_0}\)

25.3: H-2s-Upper

\(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)

25.4: H-2s-Two

\(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)

25.5: Shapiro-Wilk-Test

The Shapiro-Wilk test is a test of normality. It tests the null hypothesis that a sample came from a normally distributed population. \(P_{\text{shapiro}} > ({\alpha} = 0.05) \to \text{Data is Normal}\). Avoid using sample with more than 5000 observations.

25.6: Independent-Sample-Design-Example

Independent sample design: A simple random sample of workers is selected and each worker in the sample uses method 1. A second independent simple random sample of workers is selected and each worker in this sample uses method 2.

25.7: Matched-Sample-Design-Example

Matched sample design: One simple random sample of workers is selected. Each worker first uses one method and then uses the other method. The order of the two methods is assigned randomly to the workers, with some workers performing method 1 first and others performing method 2 first. Each worker provides a pair of data values, one value for method 1 and another value for method 2.

25.8: Hypo-Paired-Two

\(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_d = 0 \iff {H_a}: {\mu}_d \neq 0\)

25.9: H-2s-p-Lower

\(\text{\{Left or Lower \} }\space\thinspace {H_0} : {p}_1 - {p}_2 \geq 0 \iff {H_a}: {p}_1 - {p}_2 < 0\)

25.10: H-2s-p-Upper

\(\text{\{Right or Upper\} } {H_0} : {p}_1 - {p}_2 \leq 0 \iff {H_a}: {p}_1 - {p}_2 > 0\)

25.11: H-2s-p-Two

\(\text{\{Two Tail Test \} } \thinspace {H_0} : {p}_1 - {p}_2 = 0 \iff {H_a}: {p}_1 - {p}_2 \neq 0\)

26.1: Distribution-Chi-Square

Whenever a simple random sample of size \({n}\) is selected from a normal population, the sampling distribution of \(\frac{(n-1)s^2}{{\sigma}^2}\) is a chi-square distribution with \({n - 1}\) degrees of freedom.

26.2: H-1s-Var-Lower

\(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\sigma}^2 \geq {{\sigma}_0^2} \iff {H_a}: {\sigma}^2 < {{\sigma}_0^2}\)

26.3: H-1s-Var-Upper

\(\text{\{Right or Upper\} } {H_0} : {\sigma}^2 \leq {{\sigma}_0^2} \iff {H_a}: {\sigma}^2 > {{\sigma}_0^2}\)

26.4: H-1s-Var-Two

\(\text{\{Two Tail Test \} } \thinspace {H_0} : {\sigma}^2 = {{\sigma}_0^2} \iff {H_a}: {\sigma}^2 \neq {{\sigma}_0^2}\)

26.5: Distribution-F

Whenever independent simple random samples of sizes \({n}_1\) and \({n}_2\) are selected from two normal populations with equal variances \(({\sigma}_1^2 = {\sigma}_2^2)\), the sampling distribution of \(\frac{{s}_1^2}{{s}_2^2}\) is an F distribution with \(({n}_1 - 1)\) degrees of freedom for the numerator and \(({n}_2 - 1)\) degrees of freedom for the denominator.

26.6: H-2s-Var-Lower

\(\text{\{Left or Lower \} }\space\thinspace \text{Do not do this.}\)

26.7: H-2s-Var-Upper

\(\text{\{Right or Upper\} } {H_0} : {\sigma}_1^2 \leq {\sigma}_2^2 \iff {H_a}: {\sigma}_1^2 > {\sigma}_2^2\)

26.8: H-2s-Var-Two

\(\text{\{Two Tail Test \} } \thinspace {H_0} : {\sigma}_1^2 = {\sigma}_2^2 \iff {H_a}: {\sigma}_1^2 \neq {\sigma}_2^2\)

27.1: H-3p

\(\text{\{Equality of Population Proportions \}} {H_0} : {p}_1 = {p}_2 = \dots = {p}_k \iff {H_a}: \text{Not all population proportions are equal}\)

28.1: Randomization

Randomization is the process of assigning the treatments to the experimental units at random.

28.2: H-ANOVA

\(\text{\{ANOVA\}} {H_0} : {\mu}_1 = {\mu}_2 = \dots = {\mu}_k \iff {H_a}: \text{Not all population means are equal}\)

29.1: Regression-Analysis

Regression analysis can be used to develop an equation showing how two or more variables variables are related.

29.2: Variable-Dependent

The variable being predicted is called the dependent variable \(({y})\).

29.3: Variable-Independent

The variable or variables being used to predict the value of the dependent variable are called the independent variables \(({x})\).

29.4: Simple-Linear-Regression

The simplest type of regression analysis involving one independent variable and one dependent variable in which the relationship between the variables is approximated by a straight line, is called simple linear regression.

29.5: Regression-Model

The equation that describes how \({y}\) is related to \({x}\) and an error term is called the regression model. For example, simple linear regression model is given by equation (29.1)

33.1: Parametric-Methods

Parametric methods are the statistical methods that begin with an assumption about the probability distribution of the population which is often that the population has a normal distribution. A sampling distribution for the test statistic can then be derived and used to make an inference about one or more parameters of the population such as the population mean \({\mu}\) or the population standard deviation \({\sigma}\).

33.2: Distribution-free-Methods

Distribution-free methods are the Statistical methods that make no assumption about the probability distribution of the population.

33.3: Nonparametric-Methods

Nonparametric methods are the statistical methods that require no assumption about the form of the probability distribution of the population and are often referred to as distribution free methods. Several of the methods can be applied with categorical as well as quantitative data.

34.1: Data-Mining-331

Data mining is the process of discovering useful patterns and trends in large data sets.

34.2: Predictive-Analytics-331

Predictive analytics is the process of extracting information from large data sets in order to make predictions and estimates about future outcomes.

35.1: Variable-Flag-Dummy

A flag variable (or dummy variable, or indicator variable) is a categorical variable taking only two values, 0 and 1. Ex: Gender (Male, Female) can be recoded into dummy Gender (Male = 0, Female = 1).

37.1: Multicollinearity

Multicollinearity is a condition where some of the predictor variables are strongly correlated with each other.

37.2: Principle-of-Parsimony

Principle of parsimony is the problem-solving principle that “entities should not be multiplied beyond necessity.”

37.3: Overfitting

Overfitting is the production of an analysis that corresponds too closely to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably.

37.4: Underfitting

Underfitting occurs when a statistical model cannot adequately capture the underlying structure of the data.

37.5: PCA

Principal components analysis (PCA) seeks to explain the correlation structure of a set of predictor variables \({m}\), using a smaller set of linear combinations of these variables, called components \({k}\). PCA acts solely on the predictor variables, and ignores the target variable.

37.6: Eigenvalues

Let \(\mathbf{B}\) be an \(m \times m\) matrix, and let \(\mathbf{I}\) be the \(m \times m\) identity matrix. Then the scalars \(\{\lambda_1, \lambda_2, \ldots, \lambda_m\}\) are said to be the eigenvalues of \(\mathbf{B}\) if they satisfy \(|\mathbf{B} - \lambda \mathbf{I}| = 0\), where \(|\mathbf{Q}|\) denotes the determinant of Q.

37.7: Eigenvector

Let \(\mathbf{B}\) be an \(m \times m\) matrix, and let \({\lambda}\) be an eigenvalue of \(\mathbf{B}\). Then nonzero \(m \times 1\) vector \(\overrightarrow{e}\) is said to be an eigenvector of B, if \(\mathbf{B} \overrightarrow{e} = <U+0001D706>\overrightarrow{e}\).

37.8: Orthogonal

Two vectors are orthogonal if they are mathematically independent, have no correlation, and are at right angles to each other.

37.9: Communality

PCA does not extract all the variance from the variables, but only that proportion of the variance that is shared by several variables. Communality represents the proportion of variance of a particular variable that is shared with other variables. Communality values are calculated as the sum of squared component weights, for a given variable.

38.1: Unsupervised-Methods

In unsupervised methods, no target variable is identified as such. Instead, the data mining algorithm searches for patterns and structures among all the variables. The most common unsupervised data mining method is clustering. Ex: Voter Profile.

38.2: Supervised-Methods

Supervised methods are those in which there is a particular prespecified target variable and the algorithm is given many examples where the value of the target variable is provided. This allows the algorithm to learn which values of the target variable are associated with which values of the predictor variables.

38.3: A-Priori-Hypothesis

An a priori hypothesis is one that is generated prior to a research study taking place.

38.4: Cross-Validation

Cross-validation is a technique for insuring that the results uncovered in an analysis are generalizable to an independent, unseen, data set.

38.5: Overfitting-c37

Overfitting results when the provisional model tries to account for every possible trend or structure in the training set, even idiosyncratic ones such as the ‘David’ example above.

@ref(def:Bias–variance-Trade-off): Bias–variance-Trade-off

Even though the high-complexity model has low bias (error rate), it has a high variance; and even though the low-complexity model has a high bias, it has a low variance. This is known as the bias–variance trade-off. The bias–variance trade-off is another way of describing the overfitting-underfitting dilemma.

ERRORS

1.1: cannot-open-connection

Error in file(file, ifelse(append, "a", "w")) : cannot open the connection

1.2: need-finite-xlim

Error in plot.window(...) : need finite ’xlim’ values

1.3: par-old-par

Error in par(old.par) : invalid value specified for graphical parameter "pin"

2.1: plot-finite-xlim

Error in plot.window(...) : need finite ’xlim’ values

2.2: Function-Not-Found

Error in arrange(bb, day) : could not find function "arrange"

3.1: Object-Not-Found-01

Error in match.arg(method) : object ’day’ not found

3.2: Comparison-possible

Error in day == 1 : comparison (1) is possible only for atomic and list types

3.3: UseMethod-No-applicable-method

Error in UseMethod("select") : no applicable method for ’select’ applied to an object of class "function"

3.4: Object-Not-Found-02

Error: Problem with mutate() column ... column object ’arr_delay’ not found

17.1: gg-stat-count-geom-bar

Error: stat_count() can only have an x or y aesthetic.

21.1: ggplot-list

Error in is.finite(x) : default method not implemented for type ’list’

21.2: ggplot-data

Error: Must subset the data pronoun with a string.

25.1: shapiro-limit

Error in shapiro.test(...) : sample size must be between 3 and 5000

25.2: t-test-grouping

Error in t.test.formula() : grouping factor must have exactly 2 levels

35.1: Insufficient-Data

Error: Insufficient data values to produce ... bins.

36.1: stat-count-xy

Error: stat_count() can only have an x or y aesthetic.

36.2: stat-count-y

Error: stat_count() must not be used with a y aesthetic.

37.1: CorMat

Error in if (prod(R2) < 0) : missing value where TRUE/FALSE needed